Literature DB >> 32530284

Decoding SARS-CoV-2 Transmission and Evolution and Ramifications for COVID-19 Diagnosis, Vaccine, and Medicine.

Rui Wang1, Yuta Hozumi1, Changchuan Yin2, Guo-Wei Wei1,3,4.   

Abstract

Tremendous effort has been given to the development of diagnostic tests, preventive vaccines, and therapeutic medicines for coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Much of this development has been based on the reference genome collected on January 5, 2020. Based on the genotyping of 15 140 genome samples collected up to June 1, 2020, we report that SARS-CoV-2 has undergone 8309 single mutations which can be clustered into six subtypes. We introduce mutation ratio and mutation h-index to characterize the protein conservativeness and unveil that SARS-CoV-2 envelope protein, main protease, and endoribonuclease protein are relatively conservative, while SARS-CoV-2 nucleocapsid protein, spike protein, and papain-like protease are relatively nonconservative. In particular, we have identified mutations on 40% of nucleotides in the nucleocapsid gene in the population level, signaling potential impacts on the ongoing development of COVID-19 diagnosis, vaccines, and antibody and small-molecular drugs.

Entities:  

Year:  2020        PMID: 32530284      PMCID: PMC7318555          DOI: 10.1021/acs.jcim.0c00501

Source DB:  PubMed          Journal:  J Chem Inf Model        ISSN: 1549-9596            Impact factor:   4.956


Introduction

The ongoing pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has posed crucial threats to public health and the world economy since it was detected in Wuhan, China, in December 2019.[15] As of June 1, 2020, 6 057 853 cases of COVID-19 have been reported in more than 200 countries and territories, resulting in more than 371 166 deaths.[27] However, there have been no signs of slowing down nor relief at this monument partially due to the fact there are no specific anti-SARS-CoV-2 drugs and effective vaccines. SARS-CoV-2 is a positive-strand RNA virus that belongs to the beta coronavirus genus. The genomic information underpins the development of antiviral medical interventions, prophylactic vaccines, and viral diagnostic tests. The first SARS-CoV-2 genome was reported on January 5, 2020.[28] It has a genome size of 29.99 kb, which encodes for multiple nonstructural and structural proteins. The leader sequence and ORF1ab encode nonstructural proteins for RNA replication and transcription. Among various nonstructural proteins, viral papain-like (PL) proteinase, main protease (or 3CL protease), RNA polymerase, and endoribonuclease are the common targets in antiviral drug discovery. Yet, it typically takes more than ten years to put an average drug to the market. The downstream regions of the genome encode structural proteins, including spike (S) protein, envelope (E) protein, membrane (M) protein, and nucleocapsid (N) protein. Notably, S-protein uses one of its two subunits to bind directly to the host receptor angiotensin-converting enzyme 2 (ACE2), enabling virus entry into host cells.[29] The N protein, one of the most abundant viral proteins, can bind to the RNA genome and is involved in replication, assembly, and host cellular response during viral infection.[13] As a virulence factor, the E protein is a small integral membrane protein that regulates cell stress response and apoptosis and promotes inflammation.[4] The structural protein, especially, the S protein, is the candidate antigen for vaccine and antibody drug development. Developing safe and effective vaccines is urgently needed to prevent the spread of SARS-CoV-2. However, it typically takes over one year to design and test a new vaccine. Furthermore, the replication in RNA viruses, such as Influenza A, is subject to errors,[14] except nidoviruses. Coronaviruses, a kind of nidoviruses, have the ability to proofread their genomes during their genetic replication and recombination.[6] Therefore, SARS-CoV-2 might not mutate as fast as Influenza A viruses do, but still has heterogeneous and dynamic populations. The SARS-CoV-2 genome undergoes rapid mutations that are partially stimulated as a response to the challenging immunological environments arising from its transmission to the COVID-19 patients of different races, ages, and medical conditions. The vaccine developed at one time may not be effective for mitigating the infection by new mutated virus isolates. An alarming fact is that many of these mutations may devastate the ongoing effort in the development of effective medicines, preventive vaccines, and diagnostic tests. Accurate identification of the antigens and their mutations represents the most important roadblock in developing effective vaccines against COVID-19. For example, different vaccines are needed for various geographic locations due to predominant mutations in the corresponding regions. In COVID-19 diagnosis, the diagnostic kits are designed using two major methods: serological tests and molecular tests. Serological tests are to detect specific neutralizing antibodies from COVID-19 infections. Molecular diagnoses look for specific COVID-19 pathogenic genes, which usually rely on the polymerase chain reaction (PCR). Because of the fast mutations of the SARS-CoV-2 genome, genotyping analysis of SARS-CoV-2 may optimize the PCR primer design to detect SARS-CoV safely and to reduce the risk of false-negatives caused by genome sequence variations. In addition, the genotyping analysis may also reveal those highly conserved regions with very few mutations, which can be selected as a target sequence for clinical diagnosis and effective drug therapy. The evolution pattern through the highly frequent mutations of SARS-CoV-2 can be observable on short time scales. In the early infection period (i.e., February 2020), the SARS-CoV-2 variants were clustered as S and L types.[23] Recent genotyping analysis reveals a large number of mutations in various essential genes encoding the S protein, the N protein, and the RNA polymerase in the SARS-CoV-2 population.[30] Monitoring the evolutionary patterns and spread dynamics of SARS-CoV-2 is of great importance for COVID-19 control and prevention. Mutations occur in many different ways. Some mutations occur randomly. Other mutations are enforced by the host immune system surveillance, which induces viral responses. The most preserved mutations and viral evolution can be regarded as the result of the dynamic equilibrium between the random perturbation, host cell defense, and viral fitness. Therefore, the faster and wider the SARS-CoV-2 spread, the more frequent and diverse the mutations will be. The tracking and analysis of COVID-19 dynamics, transmission, and spread are of paramount importance for winning the ongoing battle against COVID-19. Genetic identification and characterization of the geographic distribution, intercontinental evolution, and global trends of SARS-CoV-2 are the most effective approaches for studying COVID-19 genomic epidemiology and offer the molecular foundation for region-specific SARS-CoV-2 vaccine design, drug discovery, and diagnostic development.[16] For example, different vaccines for the shell can be designed according to predominant mutations. This work provides the most comprehensive genotyping to reveal the transmission trajectory and spread dynamics of COVID-19 to date. Based on genotyping 15 140 SARS-CoV-2 genomes from the world as of June 1, 2020, we trace the COVID-19 transmission pathways and analyze the distribution of the subtypes of SARS-CoV-2 across the world. We use K-means methods to cluster SARS-CoV-2 mutations, which provides updated molecular information for the region-specific design of vaccines, drugs, and diagnoses. Our clustering results show that, globally, there are at least six distinct subtypes of SARS-CoV-2 genomes. While, in the U.S., there are four significant SARS-CoV-2 genotypes. We introduce mutation h-index and mutation ratio to characterize conservative and nonconservative proteins and genes. We unveil the unexpected nonconservative genes and proteins, rendering a warning for the current development of diagnostic tests, preventive vaccines, and therapeutic medicines.

Results and Discussion

COVID-19 Evolution and Clustering

Tracking the SARS-CoV-2 transmission pathways and analyzing the spread dynamics are critical to the study of genomic epidemiology. Temporospatially clustering the genotypes of SARS-CoV-2 in the transmission provides insights into diagnostic testing and vaccine development in disease control. In this work, we retrieve and genotype 15 140 SARS-CoV-2 isolates from the world as of June 1, 2020. There are 8309 single mutations in 15 140 SARS-CoV-2 isolates. Based on these mutations, we classify and track the geographical distributions of 15 140 genotype isolates by K-means clustering. The SARS-CoV-2 genotypes, represented as single nucleotide polymorphism (SNP) variants, are clustered as six groups in the world, including the U.S.. In particular, the genotypes in the U.S. are further clustered into four groups. Table lists the co-mutations with the highest number of descendants in different clusters in the world. Optimal clustering groups are established using the Elbow method in the K-means clustering algorithm (Supporting Information).
Table 1

Co-mutations with the Highest Number of Descendants in Six Distinct Clusters in the World

clustermutation sitesnumber of descendants
I[3037C>T, 14408C>T]10875
II[3037C>T, 14408C>T, 23403A>G]10830
III[14408C>T]10923
IV[3037C>T, 14408C>T, 23403A>G, 28881G>A, 28882G>A, 28883G>C]3043
V[3037C>T, 14408C>T, 23403A>G, 25563G>T]4632
VI[8782C>T, 28144T>C]1722
The detailed distribution of the SNP variants from the world for each cluster is provided in the Supporting Information. The SNP variant clusters from 76 countries that have a high number of the COVID-19 cases are listed in Table . The listed countries are the United States (US), Canada (CA), Australia (AU), United Kingdom (UK), Germany (DE), France (FR), Italy (IT), Russia (RU), China (CN), Japan (JP), Korean (KR), India (IN), Spain (ES), Saudi Arabia (SA), and Turkey (TR). The pie chart plot on the world map is described in Figure which was created by Highcharts (https://www.highcharts.com/maps/demo). The light blue, dark blue, green, red, purple, and yellow represent the Cluster I, II, III, IV, V, and VI, respectively. The color of the dominated cluster decides the base color of each country. The geographic distribution of the SNP variant clusters reflects the approximate transmission pathways and spread dynamics across the world. Several findings can be made from Table :
Table 2

Cluster Distributions of Samples from 15 Countries

countrycluster Icluster IIcluster IIIcluster IVcluster Vcluster VI
US8443114881561813975
CA122917161941
AU16314941013514677
UK53987590815321193
DE10202138420
FR41851412820
IT262491700
RU1027110930
CN832151125
JP03682030
KR0028000
IN93691411030
ES27100742532
SA14319120
TR25324900
Figure 1

Pie chart plot of six distinct clusters in the world. The light blue, dark blue, green, red, purple, and yellow represent clusters I, II, III, IV, V, and VI, respectively. The base color of each country is decided by the color of the dominant cluster.

Subtypes from clusters III and IV are causing the epidemic in the Asian countries, including those in CN, JP, and KR. The subtypes of SARS-CoV-2 in cluster VI are not spreading in the European countries (UK, DE, FR, IT, RU). All of the subtypes of SARS-CoV-2 in six different clusters can be found in CN, US, CA, AU, and ES. Among them, China initially had samples only in clusters III and VI, and its sample distributions reached to other clusters after the middle of March 2020. The dominant subtypes of SARS-CoV-2 in the COVID-19 pandemic of the United States belong to all of the six clusters. Pie chart plot of six distinct clusters in the world. The light blue, dark blue, green, red, purple, and yellow represent clusters I, II, III, IV, V, and VI, respectively. The base color of each country is decided by the color of the dominant cluster. The cluster analysis reveals that the Asian countries have three dominant subtype clusters, cluster III [14408C>T], cluster IV [3037C>T, 14408C>T, 23403A>G, 28881G>A, 28882G>A, 28883G>C], and cluster VI [8782C>T, 28144T>C]. Cluster III was detected in the early period of COVID-19 infection in China and other Asian countries. The subtype of SNP mutation in S protein, 23403A>G, is prevalent in the clusters II, IV, and V of European countries. This subtype of S protein mutation may have resulted in the wide spread of SARS-CoV-2 in European countries. Furthermore, we analyze the statistics of SNP variants located in the United States. In Table , we list the number of cases in four different clusters with respect to the west coast states (Washington (WA), California (CA), Alaska (AK), and Oregon (OR)), the east coast cities and states (New York (NY), Washington, D.C. (DC), Pennsylvania (PA), Florida (FL), Massachusetts (MA), Maryland (MD), Virginia (VA)), Wisconsin (WI), Minnesota (MN), Michigan (MI), Georgia (GA), Utah (UT), Connecticut (CT), Arizona (AZ), Idaho (ID), and Illinois (IL). Table lists the co-mutations with the highest number of descendants in different clusters in the United States. Notably, several findings on the genotypes of clusters in the US are as follows:
Table 3

Cluster Distributions of Samples from 20 States and Cities in the United States

statecluster Acluster Bcluster Ccluster D
WA30480540355
CA885311282
AK150311
OR7440
NY3241566807
DC2117
PA3016
FL8234
MA8029
MD4035
VA6711965
WI150815157
MN29281957
MI112572
GA52112
UT139323
CT391114
AZ318435
ID31018
IL2381920
others1461339173
total13089704971812
Table 4

Mutation Sites with Highest Frequency in Each Cluster in the United States

 mutation sitesnumber of descendants
cluster A[3037C>T, 14408C>T]10875
cluster B[8782C > T, 18060C>T, 28144T>C]1127
cluster C[11083G>T]1646
cluster D[241C>T, 3037C>T, 14408C>T, 23403A>G, 25563G>T]4494
The subtypes of SARS-CoV-2 in all of the clusters are spreading out among the west coast states. Especially, the state of Washington is dominated by cluster B. East coast states are dominated by subtypes from clusters A and C, especially in New York. The subtypes of SARS-CoV-2 in cluster A are spread throughout the United States. Figure is the pie chart plot of the four distinct clusters in the US, which was also created by Highcharts. The colors, blue, red, yellow, and green represent clusters A, B, C, and D, respectively. The base color of each state corresponds to its dominant cluster. We note that cluster D in the U.S. is derived from cluster V in the world, with an additional mutation at the leader sequence 241. The high spread in New York is consistent with the high transmission of SARS-CoV-2 in European countries, where the subtype in cluster V is predominant.
Figure 2

Pie chart plot of four distinct clusters in the US. The blue, red, yellow, and green colors represent clusters A, B, C, and D. The base color of each state corresponds to its dominant cluster.

Pie chart plot of four distinct clusters in the US. The blue, red, yellow, and green colors represent clusters A, B, C, and D. The base color of each state corresponds to its dominant cluster.

Ramifications for COVID-19 Diagnosis, Vaccine, and Medicine

Protein-Specific Mutation Analysis

Figures and 4 depict the distribution and frequencies of SNP mutations of SARS-CoV-2 isolates from 15 140 genome samples in the world with respect to the reference genome of January 5, 2020. The statistics of single mutations on various SARS-CoV-2 proteins that occurred in the recorded genomes between January 5, 2020, and June 1, 2020, are listed in Table . The spike protein has the highest number of mutations on gene of 1004, while the envelope protein has the lowest number of mutations of 52. Since the sizes of proteins vary dramatically from 1273 for the spike protein to 75 for the envelope protein, it is useful to consider the mutation ratio, i.e., the number of mutations per residue. In this category, the RNA-dependent RNA polymerase has the lowest score of 0.217, whereas the nucleocapsid protein has the highest score of 0.400, i.e, 503 mutations on its 1257 nucleotides (419 residues). Note that main protease has the second-lowest mutation ratio of 0.221, indicating its conservative nature. Another relatively conservative protein judged by the mutations ratio in terms of gene is the envelope protein, the MRGene = 0.231.
Figure 3

Distribution of SNP mutations of SARS-CoV-2 isolates from 15 140 genome samples in the world with respect to the reference genome of January 5, 2020 (GenBank access number: NC_045512.2).

Figure 4

Frequencies of the single SNP mutations of SARS-CoV-2 on the genome samples in the world with respect to the reference genome of January 5, 2020 (GenBank access number: NC_045512.2).

Table 5

Protein-Specific and Gene-Specific Statistics of SARS-CoV-2 Single Mutationsa

proteingene lengthprotein lengthNMGeneNMProMRGeneMRPromutation h-index
spike protein3819127310043910.2630.30726
main protease918306203780.2210.25516
papain-like protease9453151871050.2550.33310
RNA polymerase27969326072280.2170.24521
endoribo-nuclease10383462561100.2470.31812
envelope (E) protein2257552230.2310.3079
membrane protein666222165600.2480.27014
nucleocapsid (N) protein12574195032050.4000.48933

NMGene and NMPro are the number of mutations in terms of gene and protein, respectively. MRGene is the mutation ratio of gene, and MRPro is the ratio of the non-degenerated mutations of a protein. Mutation h-index focus on the gene-specific h-index.

Distribution of SNP mutations of SARS-CoV-2 isolates from 15 140 genome samples in the world with respect to the reference genome of January 5, 2020 (GenBank access number: NC_045512.2). Frequencies of the single SNP mutations of SARS-CoV-2 on the genome samples in the world with respect to the reference genome of January 5, 2020 (GenBank access number: NC_045512.2). NMGene and NMPro are the number of mutations in terms of gene and protein, respectively. MRGene is the mutation ratio of gene, and MRPro is the ratio of the non-degenerated mutations of a protein. Mutation h-index focus on the gene-specific h-index. Counting the number of single mutations and mutation ratio does not reflect the fact that some mutations occur numerous times over genome samples while other mutations may happen only on a few genome samples. To account for the frequency effect of mutations, we introduce a mutation h-index to measure both the number of mutations and the frequency of mutations of a given protein or genetic section. It is defined as the maximum value of h such that the given protein genetic section has h single mutations that have each occurred at least h times. It is very interesting to note from Table that the mutation h-index correlates very well with the number of mutations on gene; the Pearson correlation coefficient is 0.711. Specifically, N protein has both the highest MRGene of 0.400 and the highest h-index of 33, suggesting that it is the most nonconservative protein in SARS-CoV-2 genomes. In contrast, the envelope protein has the third-lowest number of mutations per residues of 0.231 and the lowest h-index of 9, indicating its relatively conservative nature. By combining the number of mutations per residue and the mutation h-index, we report that the most conservative SARS-CoV-2 proteins is the envelope. It is found that the most nonconservative SARS-CoV-2 proteins are (1) the nucleocapsid protein, (2) the spike protein, and (3) the papain-like protease. The number of mutations in terms of gene (NMGene) and the number of mutations in terms of protein (NMPro) we reported are accumulated numbers that from all of the 15 140 genome isolates. If we focus on the single genome isolate, the maximum number of mutations on the whole genome sequence is 24.

Diagnosis

Real-time RT-PCR (rRT-PCR) is routinely used in the qualitative detection of nucleic acid from SARS-CoV-2 for diagnostic testing COVID-19.[3,24] The primers used in the rRT-PCR are critical for the precise diagnosis of COVID-19 and the discovery of new strains. The primer sequences are specially designed for amplifying the conserved regions across the different existing strains for high specificity and sensitivity and also are subject to genotype changes as the SARS-CoV-2 coronavirus evolves. In diagnostic testing COVID-19, many rRT-PCR primers are designed to detect for three perceived conservative SARS-CoV-2 regions: (1) RNA-dependent RNA polymerase (RdRP) gene in ORF1ab region, (2) the E protein gene, and (3) the N protein gene.[3] Our genotyping statistics given in Table indicate that the nucleocapsid protein is the worst choice. Among the four structural proteins of SARS-CoV-2, the spike surface glycoprotein (S) of 1273 amino acid residues, nucleocapsid protein (N) of 419 amino acid residues, membrane protein (M) of 222 amino acid residues, and envelope protein (E protein) of 75 amino acid residues, the S protein is the most divergent with 1004 unique mutations among the 15 140 SARS-CoV-2 genomes. The N protein has 503 unique mutations, and the envelope (E) protein has 52 mutations. Considering the lengths of the proteins, all the four structural proteins undergo many mutations. The RdRP gene, which is often used in diagnostic testing COVID-19, also has 607 mutations. Therefore, all three regions in the routine rRT-PCR target, namely RdRP, the N protein gene, and the E protein genes, have significant mutations. Precise and robust diagnosis tools must be re-established according to the conserved regions and predominated mutations in the SARS-CoV-2 genomes detailed in the Supporting Information.

Vaccine Development

Vaccines are mostly associated with the S protein. Compared to SARS-CoV, SARS-CoV-2 has a unique furin cleavage site, where four amino acid residues (PRRA) are inserted into the S1–S2 junction region 681–684 of the S protein.[25] The furin cleavage site is crucial for zoonotic transmission of SARS-CoV-2.[7] This study reveals crucial mutations near the S1–S2 junction region in the S protein, including 23403A>G-(D614G), 23422C>T-(V620V), 23575C>T-(C671C), 23586A>G-(Q675R), 23611G>A-(R683R), 23707C>T-(P715P), 23731C>T-(T723T), 23849T>C-(L763L), and 23929C>T-(Y789Y). Moreover, these mutations of the S protein SARS-CoV-2 are located at the epitope region, corresponding to the regions 469–882 and 599–620 in SARS-CoV.[19] Additionally, many mutated amino acids are on the receptor-binding domain (RBD) of the S protein, as shown in Figure . Unfortunately, the S protein is the second most nonconservative protein in the genome based on the number of mutations per residue and mutation h-index. In fact, about half of the receptor-binding domain residues of the S proteins have had mutations in the past few months as shown in Figure . Because the surface accessibility of epitope is also important for the interaction of antibody and antigen, these mutations are critical for the antigenicity of the S protein.
Figure 5

Illustration of SARS-CoV-2 spike protein mutations using 6VXX as a template.

Figure 6

Illustration of SARS-CoV-2 spike-protein receptor binding domain (RBD) mutation using 6M0J as a template. It is noted that nearly half of the residues in RBD have undergone mutations in the few months.

Illustration of SARS-CoV-2 spike protein mutations using 6VXX as a template. Illustration of SARS-CoV-2 spike-protein receptor binding domain (RBD) mutation using 6M0J as a template. It is noted that nearly half of the residues in RBD have undergone mutations in the few months. Convalescent COVID-19 patients show a neutralizing antibody response after infection, which is directed mostly against the S protein.[18] The neutralizing antibody responses against SARS-CoV-2 could give some defense against SARS-CoV-2 infection, thus having implications for preventing SARS-CoV-2 outbreaks. The divergence of S proteins and the nonconserved regions of the S proteins might contribute to the antigenicity. The highly frequent mutations identified in the S protein may reduce the durability of the SARS-CoV-2 vaccine’s immunity or undermine the current development of vaccines. The existing mutations must be considered when designing a new vaccine. Additionally, a cocktail of multiple vaccines has a better chance of preventing COVID-19 infections.

Drug Discovery

Unfortunately, there is no specific effective drug for SARS-CoV-2 at this point. Potential drugs include small-molecular drugs and antibody drugs. Much of the effort in small-molecular drug discovery focuses on SARS-CoV-2 nonstructural proteins. Among the major nonstructural proteins of SARS-CoV-2, the main protease of 306 amino acids has 78 mutations with 0.255 mutations per residue and the mutation h-index of 16, RNA polymerase of 932 amino acids has 228 mutations with 0.245 mutations per residue and the mutation h-index of 21, and papain-like protease of 945 amino acids has 105 mutations with 0.333 mutations per residue and the mutation h-index of 10. In fact, the main protease is the most popular drug target because there are no similar known genes in the human genome, which implies that SARS-CoV-2 main protease inhibitors will likely be less toxic.[10] The present study suggests that the main protease is the second most conservative protein. Therefore, it remains the most attractive target for drug discovery. Therapeutic antibodies got started from cancer treatments and now applies to infectious diseases by targeting pathogens.[1] Antibody drugs are highly specificity and versatile in the treatment of infectious diseases. Their working principle involves the host immune system. The time used to develop antibody therapeutics are usually considerably shorter than that used to develop a vaccine. Many SARS-CoV-2 antibody drugs are isolated from patient blood and target the S proteins. Although there many binding sites on the S protein that antibodies can target, the ones that are most effective in neutralizing SARS-CoV-2 block the receptor-binding domain (RBD) of the host cell angiotensin-converting enzyme 2 (ACE2) receptor. The RBD is a dongle-shape protein at the end of the virus’s spikes. As mentioned above, there are many mutations on the S proteins. The RBD is also prone to mutations. Some mutations that break hydrogen bonds and/or salt bridges in antibody–antigen interactions will have a large impact. However, silent mutations, such as those that replace hydrophobic residues with other hydrophobic residues, will typically have little effect. To avoid the failure of one specific antibody, the cocktail treatments that include several different antibodies might be required to treat SARS-CoV-2 that undergoes antigenic mutations.

Protein-Specific Discussion

Spike Glycoprotein

The SARS-CoV-2 spike glycoprotein, or S protein, comprised of two subunits, S1 and S2, of very different properties;[25] see Figure . Among them, the S1 subunit, as shown in Figure , contains the receptor-binding domain (RBD) responsible for binding to the host cell receptor angiotensin-converting enzyme 2 (ACE2). The RBD is also the common binding domain for antibodies. The S2 subunit offers the structural support of the S protein and mediates fusion between the viral and host cell membranes. After the fusion, the virus releases the viral genome into the host cell. The S1 RBD protein plays key parts in the induction of neutralizing-antibody and T-cell responses, as well as protective immunity. However, S2 and extracellular domain (ECD) of spike protein and their combination are commonly used in recombinant proteins in SARS-CoV-2 antibody development. As shown in Table , the S protein is the most heterogeneous structural protein with a significant number of mutations as shown in Figures and 6 and Table . The divergence of the spike protein, the nonconserved regions of the spike protein might contribute to the antigenicity difference in SARS-CoV-2 isolates. We found that most of the high frequent mutations of the S protein are located in the S1 subunit. Figure indicates that near half of the amino acid residues have had mutations since January 5, 2020. One of the important mutations at S1 is 23010T>C (V483A) within the RBD for ACE2 binding, and the total frequency of 23010T>C (V483A) is 23. The structural study revealed that the amino acids 442–487 in the S1 subunit may impact viral binding to human ACE2.[9,26] The mutations identified in this study imply the change in ACE2 binding affinity and the transmissibility of SARS-CoV-2 as well as negative impacts in preventive vaccine and diagnostic test development.
Table 6

Top 10 High Frequency Single SNP Genotypes in the Spike Surface Glycoprotein of SARS-CoV-2

rankSNPprotein mutationtotal frequencycluster Icluster IIcluster IIIcluster IVcluster Vcluster VI
top123403A>GD614G109692333260970296529911
top223731C>TT723T228240120300
top323929C>TY789Y22820225100
top424368G>TD936Y11037012700
top521575C>TL5F9822928151410
top624862A>GT1100T90145801800
top724390G>CS943T5620728100
top824389A>CS943R5620728100
top924933G>TG1124V4715021713
top1023707C>TP715P441039040

Main Protease

SARS-CoV-2 main protease, or 3CL protease, is essential for cleaving the polyproteins that are translated from the viral RNA.[10] It operates at multiple cleavage sites on the large polyprotein through the proteolytic processing of replicase polyproteins and plays a pivotal role in viral gene expression and replication. SARS-CoV-2 main protease is one of the most attractive targets for anti-CoV drug design because its inhibition would block viral replication and it is unlikely to be toxic due to no known similar human proteases. Another reason for the focused drug discovery efforts in developing SARS-CoV-2 main protease inhibitors is that this protein is relatively conservative as shown in Table . Figure illustrates the main protease mutation patterns. Figure further highlights the inhibitor binding domain (BD). Indeed, the main protease is relatively conservative compared to the spike protein. Table lists top 10 mutations and their frequency in our data set. It is interesting to see that many mutations, such as D176D, R298R, N151N, are degenerate ones. One possible explanation is that nondegenerates may be nonsilent and likely cause unsurvivable disruption to the virus. Note that mutation G15S mostly occurs in cluster IV. Mutation R298R is restricted to cluster IV. Some other mutations, such as D248E, A266V, N151N, and T45I are specific to certain clusters. Nonetheless, some mutations at the BD shown in Figure are worth noting. They can undermine the ongoing drug discovery effort.
Figure 7

Illustration of SARS-CoV-2 main protease mutations using 6LU7 as a template.[10]

Figure 8

Illustration of SARS-CoV-2 main protease binding domain (BD) mutations of 6LU7.

Table 7

Top 10 High Frequency Single SNP Genotypes in the Main Protease of SARS-CoV-2

rankSNPprotein mutationtotal frequencycluster Icluster IIcluster IIIcluster IVcluster Vcluster VI
top110097G>AG15S224230120000
top210323A>GK90R9587113111
top310798C>AD248E8844440000
top410851C>TA266V8625000610
top510582C>TD176D5320110310
top610319C>TL89F5028140170
top710948A>GR298R330003300
top810507C>TN151N3231217000
top910265G>AG71S313002800
top1010188C>TT45I272301030
Illustration of SARS-CoV-2 main protease mutations using 6LU7 as a template.[10] Illustration of SARS-CoV-2 main protease binding domain (BD) mutations of 6LU7.

Papain-like Protease

SARS-CoV-2 papain-like protease (PLPro) is a cysteine cleavage protein located within the nonstructural protein 3 (NSP3) section of the viral genome.[17] Like the main protease, PLPro activity is required to cleave the viral polyprotein into functional, mature subunits and, thereby, contributes to the biogenesis of the virus replication. Additionally, PLPro possesses a deubiquitinating activity. The SARS PLPro is also a major therapeutic and diagnostic target. As shown in Table , the SARS PLPro is prone to mutations. Figure shows that mutations are all over the places in PLPro. Table lists the top 10 mutations in PLPro. Three of these mutations are degenerate ones. Note that only two of the top mutations occurred in cluster II. In contrast, cluster I has many different mutations.
Figure 9

Illustration of SARS-CoV-2 papain-like protease mutations using 6W9C as a template.

Table 8

Top 10 High Frequency Single SNP Genotypes in the Papain-Like Protease of SARS-CoV-2

rankSNPprotein mutationtotal frequencycluster Icluster IIcluster IIIcluster IVcluster Vcluster VI
top15142C>TT808I410041000
top25730C>TT1004I22304924
top35784C>TT1022I190002017
top45062G>TL781F151014000
top55467C>TY916Y151005000
top65183C>TP822S15213270
top75230G>TK837N12750000
top85572G>TM951I11009002
top95812C>TD1031D10105310
top105284C>TN855N10801100
Illustration of SARS-CoV-2 papain-like protease mutations using 6W9C as a template.

RNA Polymerase

SARS RNA-dependent RNA polymerase (RdRP) is an enzyme that catalyzes the synthesis of the SARS RNA strand complementarily to the SARS-CoV-2 RNA template and is thus essential to the replication of SARS-CoV-2 RNA.[8] As one of the nonstructural proteins, RdRPs are located in the early part of ORF1b section. Like most other RNA viruses, SARS-CoV-2 RdRPs are considered to be highly conserved to maintain viral functions and thus targeted in antiviral drug development as well as diagnostic tests. On the other hand, the SARS-CoV-2 RNA polymerase lacks proofreading capability and thus its mutations are deemed to happen as shown in Table . Figure illustrates the SARS-CoV-2 RdRP mutations since January 5, 2020. Surprisingly, there are many mutations in SARS-CoV-2 RdRP. Table describes the top 10 mutations. As in other cases, five of these mutations are degenerate ones.
Figure 10

Illustration of SARS-CoV-2 RNA-polymerase mutations using 6M71 as a template.

Table 9

Top 10 High Frequency Single SNP Genotypes in the RNA Dependent Polymerase of SARS-CoV-2

rankSNPprotein mutationtotal frequencycluster Icluster IIcluster IIIcluster IVcluster Vcluster VI
top114408C>TP323L109252309260268295529910
top214805C>TY455Y12429012023001
top315324C>TN628N40512825318510
top413730C>TA97V2631120232000
top513536C>TY32Y12123019250
top613862C>TT141I11861532020
top714786C>TA449V98531432260
top815540C>TV700V391037100
top913627G>TD63Y360135000
top1014877C>TY479Y342021029
Illustration of SARS-CoV-2 RNA-polymerase mutations using 6M71 as a template.

Endoribo-nuclease

Endoribo-nuclease (NendoU) protein is a nidoviral RNAuridylate-specific enzyme that cleaves RNA.[11] It contains a C-terminal catalytic domain belonging to the NendoU family RNA processing. The NendoU protein is presented among coronaviruses, arteriviruses, and toroviruses. The many aspects of the detailed function and activity of SARS-CoV-2 NendoU protein are yet to be revealed. Figure depicts SARS-CoV-2 NendoU protein mutations. As in most other SARS-CoV-2 proteins, mutations have occurred over different parts. Table shows that NendoU is relatively conservative. Table lists the top 10 high-frequency mutations of the SARS-CoV-2 NendoU protein that occurred in the past few months. Four of these mutations are degenerate ones. The frequencies of these mutations range from 153 to 15. Note that Cluster VI only has one of these mutations.
Figure 11

Illustration of SARS-CoV-2 Endoribo-nuclease protein mutations using 6VWW as a template.

Table 10

Top 10 High Frequency Single SNP Genotypes in the Endoribo-nuclease of SARS-CoV-2

rankSNPprotein mutationtotal frequencycluster Icluster IIcluster IIIcluster IVcluster Vcluster VI
top119839T>CN73N15370014600
top219684G>TV22L632057400
top320578G>TV320L5942161000
top420134G>TV172L3910251030
top520148C>TF176F313120502
top619999G>TV127F3014001150
top720316C>TF232F250025000
top820270C>TA217V223019000
top920275G>AD219N201171010
top1020031C>AA137A151001500
Illustration of SARS-CoV-2 Endoribo-nuclease protein mutations using 6VWW as a template.

Envelope Protein

The SARS-CoV-2 envelope (E) protein is one of SARS-CoV’s four structural proteins. As a transmembrane protein, it involves in ion channel activity and thus facilitates viral assembly, budding, envelope formation, pathogenesis, and release of the virus.[22] The E protein may not be essential for viral replication, but it is for pathogenesis. Figure illustrates E protein as a very small pentamer with a few mutations. Table shows its top 10 mutations. Note that the first four mutations are degenerate ones. All other mutations have relatively low frequencies. As shown in Table , the SARS-CoV-2 E protein is very conservative.
Figure 12

Illustration of SARS-CoV-2 envelope protein mutations using 5X29 as a template.

Table 11

Top 10 High Frequency Single SNP Genotypes in the Envelope (E) Protein of SARS-CoV-2

rankSNPprotein mutationtotal frequencycluster Icluster IIcluster IIIcluster IVcluster Vcluster VI
top126340C>TA32A160214000
top226256C>TF4F12260400
top326319A>TV25V10108001
top426319A>GV25V8107000
top526270C>TT9I7410200
top626416G>TV58F5101300
top726326C>TL28L5005000
top826314G>AV24M4000400
top926262G>AS6S4101020
top1026370C>TY42Y4103000
Illustration of SARS-CoV-2 envelope protein mutations using 5X29 as a template.

Nucleocapsid Protein

SARS-CoV-2 nucleocapsid (N) protein[2] is another structural protein. Its primary function is to encapsidate the viral genome. To do so, it is heavily phosphorylated (or charged) and, thereby, can bind with RNA. Additionally, SARS-CoV-2 N protein confirms the viral genome to replicase-transcriptase complex (RTC) and plays a crucial role in viral genome encapsulation. Therefore, it may function completely differently at different stages of the viral life cycle. SARS-CoV-2 N protein is considered to be one of the most conservative SARS-CoV-2 proteins in the literature and is a popular target for diagnosis of vaccine development.[3] The present works shown in Table indicate that the SARS-CoV-2 N protein is the worst target of any drug, vaccine, and diagnostic development. Figure is the illustration of SARS-CoV-2 nucleocapsid phosphoprotein mutations using 6VYO as a template.
Figure 13

Illustration of SARS-CoV-2 nucleocapsid phosphoprotein mutations using 6VYO as a template.

Illustration of SARS-CoV-2 nucleocapsid phosphoprotein mutations using 6VYO as a template. Table presents the top 10 mutations of the SARS-CoV-2 N protein since January 5, 2020. Note that only 2 out of the top 10 mutations are degenerate ones, which is a significantly lower ratio than that of other proteins. The frequency of 10th mutation is 78, which suggests there are many mutations associated with these mediate-sized proteins. Most top mutations occurred to clusters I, III, and IV. Clusters V and VI have almost none of the top 10 mutations.
Table 12

Top 10 High Frequency Single SNP Genotypes in the Nucleocapsid Phosphoprotein of SARS-CoV-2

rankSNPprotein mutationtotal frequencycluster Icluster IIcluster IIIcluster IVcluster Vcluster VI
top128881G>AR203K3083100117296311
top228882G>AR203K307696014296600
top328883G>CG204R307796114296600
top428311C>TP13L32313317101
top528657C>TD128D19112183302
top628688T>CL139L16311161000
top728836C>TS188L12064531020
top828878G>AS202N910091000
top928580G>TD103Y791137400
top1029148T>CI292T783117300

Membrane Protein

SARS-CoV-2 membrane (M) protein is another structural protein and plays a central role in viral assembly and viral particle formation. It exists as a dimer in the virion and has certain geometric shapes to enable certain membrane curvature and binding to nucleocapsid proteins. Similar to other SARS-CoV proteins, M protein is also a popular target for viral diagnosis and vaccines. Table gives SARS-CoV-2 M protein the middle ranking for its conservation. Table details the top 10 mutations in SARS-CoV-2 M protein that occurred in the past few months. Eight of these mutations are degenerate. Clusters I and V have relatively a few of these mutations.
Table 13

Top 10 High Frequency Single SNP Genotypes in the Membrane Glycoprotein of SARS-CoV-2

rankSNPprotein mutationtotal frequencycluster Icluster IIcluster IIIcluster IVcluster Vcluster VI
top127046C>TT175M306141228900
top226530A>GD3G153411101001
top326729T>CA69A11900119000
top426951G>AV143V6421112390
top526750C>TI76I490104611
top626681C>TF53F267110710
top726864A>GP114P211047000
top826936C>TL138L170210113
top926873C>TN117N17423440
top1026625C>TL35L17800180

Material and Methods

Data Collection and Preprocessing

On January 5, 2020, the complete genome sequence of SARS-CoV-2 was first released on GenBank (access number: NC_045512.2) by Zhang’s group at Fudan University.[28] Since then, there has been a rapid accumulation of SARS-CoV-2 genome sequences. In this work, 15 140 complete genome sequences with high coverage of SARS-CoV-2 strains from the infected individuals in the world have been downloaded from the GISAID database[20] (https://www.gisaid.org/) as of June 1, 2020. All the records in GISAID without the exact submission date were not taken into considerations. To rearrange the 15 140 complete genome sequences according to the reference SARS-CoV-2 genome, multiple sequence alignment (MSA) was carried out by using Clustal Omega[21] with default parameters.

SNP Genotyping

SNP genotyping measures the genetic variations between different members of a species. Establishing the SNP genotyping method for the investigation of the genotype changes during the transmission and evolution of SARS-CoV-2 is of great importance. By analyzing the rearranged genome sequences, SNP profiles which record all of the SNP positions in teams of the nucleotide changes and its corresponding positions can be constructed. The SNP profiles of a given genome of a COVID-19 patient capture all the differences from a complete reference genome sequence and can be considered as the genotype of the individual SARS-CoV-2.

Distance of SNP Variants

The Jaccard distance measures dissimilarity between sample sets. The Jaccard distance of SNP variants is widely employed in the phylogenetic analysis of human or bacterial genomes.[30] In this work, we utilize the Jaccard distance to compare the difference between the SNP variant profiles of SARS-CoV-2 genomes. The Jaccard similarity coefficient, also known as the Jaccard index, is defined as the intersection size divided by the union of two sets A, B:[12]The Jaccard distance of two sets A, B is scored as the difference between one and the Jaccard similarity coefficient and is a metric on the collection of all finite sets:Therefore, the genetic distance of two genomes corresponds to the Jaccard distance of their SNP variants. If A ∩ B ≠ ⌀, A ⊂ B, and B ⊂ A, then we say these two SNP variants are relatives. If A ⊂ B, then A is the ancestor of B and B is the descendant of A. In principle, the Jaccard distance measure of SNP variants takes account of the ordering of SNP positions, i.e., transmission trajectory, when an appropriate reference sample is selected. However, one may fail to identify the infection pathways from the mutual Jaccard distances of multiple samples. In this case, the dates of the sample collections offer useful information. Additionally, clustering techniques, such as k-means described below, enable us to characterize the spread of COVID-19 onto the communities.

K-Means Clustering

K-means clustering is one of the fundamental unsupervised algorithms in machine learning which aims at partitioning a given data set X = {x1, x2, ..., x, ..., x}, x ∈ into k clusters {C1, C2, ..., C}, k ≤ N such that the specific clustering criteria are optimized. More specifically, the standard K-means clustering algorithm starts to pick k points as cluster centers randomly and then allocates each data to its nearest cluster. The cluster centers will be updated iteratively by minimizing the within-cluster sum of squares (WCSS) which is defined bywhere μ is the mean of points located in the kth cluster C and n is the number of points in C. Here, ∥•∥2 denotes the L2 distance. The algorithm above only provides a way to obtain the optimal partition for a fixed number of clusters. However, we are interested in finding the best number of clusters for the SNP variants. Therefore, the Elbow method is applied. By varying the number of clusters k, a set of WCSS can be calculated in the K-means clustering process, and then the plot of WCSS according to the number of clusters k can be carried out. The location of the elbow in this plot will be considered as the optimal number of clusters. To be noticed, the WCSS measures the variability of the points within each cluster which is influenced by the number of points N. Therefore, as the number of total points of N increases, the value of WCSS becomes larger. Additionally, the performance of k-means clustering depends on the selection of the specific distance. In this work, we propose to implement K-means clustering with the Elbow method for analyzing the optimal number of the subtypes of SARS-CoV-2 SNP variants. The Jaccard distance-based and location-based representations are considered as the input features for the K-means clustering method.

Jaccard Distance-Based Representation

Suppose we have a total of N SNP variants concerning a reference genome in a SARS-CoV-2 sample. The location of the mutation sites for each SNP variant will be saved in the set S, i = 1, 2, ..., N. The Jaccard distance between two different sets (or samples) S, S is denoted as d(S, S). Therefore, the N × N Jaccard distance-based representation will be

Location-Based Representation

Suppose we have N SNP variants with respect to a reference genome in a SARS-CoV-2 sample. Among them, M different mutation sites can be counted. For the ith SNP variant, V = [v1, v2, ..., v], i = 1, 2, ..., N is a 1 × M vector which satisfies the following:Therefore, an N × M location-based representation will be

Principal Component Analysis (PCA)

Hundreds of complete genome sequences are deposited to GISAID every day, which results in an ever-growing massive quantity of high dimensional data representations for the K-means clustering. For example, if the data set of an organism involves 10 000 SNPs, the initial representation will be a 10 000-dimensional vector for each sample, which can be computationally difficult for a simple K-means clustering algorithm. Therefore, a dimensionality reduction method is used to preprocess the data. The essential idea of PCA-based K-means clustering is to invoke the PCA to obtain a reduced-dimensional representation of each sample before performing the K-means clustering. In practice, one can select a few lowest dimensional principal components as the K-means input for each sample. In ref (5), the authors proved that the principal components are the continuous solution of the cluster indicators in the K-means clustering method, which provides us a rigorous mathematical tool to embed our high-dimensional data into a low-dimensional PCA subspace.

Conclusion

The rapid global transmission of coronavirus disease 2019 (COVID-19) has offered some of the most heterogeneous, diverse, and challenging mutagenic environments to stimulate dramatic genetic evolution and response from severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). This work provides the most comprehensive genotyping of SARS-CoV-2 transmission and evolution up to date based on 15 140 genome samples and reveals six clusters of the COVID-19 genomes and associated mutations on eight different SARS-CoV-2 proteins. We introduce mutation h-index and mutation ratio to qualify individual protein’s degree of nonconservativeness. We unveil that SARS-CoV-2 envelope protein, main protease, and endoribonuclease protein are relatively the most conservative, whereas SARS-CoV-2 nucleocapsid protein, spike protein, and papain-like protease are relatively the most nonconservative. We report that all of the SARS-CoV-2 proteins have undergone intensive mutations since January 5, 2020, and some of these mutations might seriously undermine ongoing efforts on COVID-19 diagnostic testing, vaccine development, antibody therapeutics, and small-molecular drug discovery.

Data Availability

The nucleotide sequences of the SARS-CoV-2 genomes used in this analysis are available, upon free registration, from the GISAID database (https://www.gisaid.org/). Eighteen tables are provided in the Supporting Information for SNP variants of 15 140 SARS-CoV-2 samples across the world, SNP variants of 4587 SARS-CoV-2 samples in the US, SNP variants in six global clusters, SNP variants in four US clusters, and mutation records for eight SARS-CoV-2 proteins. The acknowledgments of the SARS-COV-2 genomes are also given in the Supporting Information.
  31 in total

Review 1.  Decoding Asymptomatic COVID-19 Infection and Transmission.

Authors:  Rui Wang; Jiahui Chen; Yuta Hozumi; Changchuan Yin; Guo-Wei Wei
Journal:  J Phys Chem Lett       Date:  2020-11-12       Impact factor: 6.475

Review 2.  Methodology-Centered Review of Molecular Modeling, Simulation, and Prediction of SARS-CoV-2.

Authors:  Kaifu Gao; Rui Wang; Jiahui Chen; Limei Cheng; Jaclyn Frishcosy; Yuta Huzumi; Yuchi Qiu; Tom Schluckbier; Xiaoqi Wei; Guo-Wei Wei
Journal:  Chem Rev       Date:  2022-05-20       Impact factor: 72.087

3.  In silico analysis of SARS-CoV-2 spike protein N501Y and N501T mutation effects on human ACE2 binding.

Authors:  Hasan Çubuk; Mehmet Özbi L
Journal:  J Mol Graph Model       Date:  2022-07-01       Impact factor: 2.942

Review 4.  Evolution of SARS-CoV-2: Review of Mutations, Role of the Host Immune System.

Authors:  Helene Banoun
Journal:  Nephron       Date:  2021-04-28       Impact factor: 2.847

5.  Characterizing SARS-CoV-2 mutations in the United States.

Authors:  Rui Wang; Jiahui Chen; Kaifu Gao; Yuta Hozumi; Changchuan Yin; Guowei Wei
Journal:  Res Sq       Date:  2020-08-11

6.  Analysis of SARS-CoV-2 mutations in the United States suggests presence of four substrains and novel variants.

Authors:  Rui Wang; Jiahui Chen; Kaifu Gao; Yuta Hozumi; Changchuan Yin; Guo-Wei Wei
Journal:  Commun Biol       Date:  2021-02-15

7.  Molecular dynamics simulations and functional studies reveal that hBD-2 binds SARS-CoV-2 spike RBD and blocks viral entry into ACE2 expressing cells.

Authors:  Liqun Zhang; Santosh K Ghosh; Shrikanth C Basavarajappa; Jeannine Muller-Greven; Jackson Penfield; Ann Brewer; Parameswaran Ramakrishnan; Matthias Buck; Aaron Weinberg
Journal:  bioRxiv       Date:  2021-01-07

8.  A Comprehensive Molecular Epidemiological Analysis of SARS-CoV-2 Infection in Cyprus from April 2020 to January 2021: Evidence of a Highly Polyphyletic and Evolving Epidemic.

Authors:  Andreas C Chrysostomou; Bram Vrancken; George Koumbaris; George Themistokleous; Antonia Aristokleous; Christina Masia; Christina Eleftheriou; Costakis Iοannou; Dora C Stylianou; Marios Ioannides; Panagiotis Petrou; Vasilis Georgiou; Amalia Hatziyianni; Philippe Lemey; Anne-Mieke Vandamme; Philippos P Patsalis; Leondios G Kostrikis
Journal:  Viruses       Date:  2021-06-09       Impact factor: 5.048

9.  Protease targeted COVID-19 drug discovery and its challenges: Insight into viral main protease (Mpro) and papain-like protease (PLpro) inhibitors.

Authors:  Sk Abdul Amin; Suvankar Banerjee; Kalyan Ghosh; Shovanlal Gayen; Tarun Jha
Journal:  Bioorg Med Chem       Date:  2020-11-06       Impact factor: 3.641

10.  Host Immune Response Driving SARS-CoV-2 Evolution.

Authors:  Rui Wang; Yuta Hozumi; Yong-Hui Zheng; Changchuan Yin; Guo-Wei Wei
Journal:  Viruses       Date:  2020-09-27       Impact factor: 5.048

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.