Literature DB >> 35156058

Mutations in human SARS-CoV-2 spike proteins, potential drug binding and epitope sites for COVID-19 therapeutics development.

Abstract

The comparison of 303,250 human SARS-CoV-2 spike protein sequences with the reference protein sequence Wuhan-Hu-1, showed ∼96.5% of the spike protein sequence has undergone the mutations till date, since outbreak of the COVID-19 pandemic disease that was first reported in December 2019. A total of 1,269,629 mutations were detected corresponding to 1,229 distinct mutation sites in the spike proteins comprising 1,273 amino acid residues. Thereby, ∼3.5% of the human SARS-CoV-2 spike protein sequence has remained invariant in the past two years. Considering different mutations occur at the same mutation site, a total of 4,729 distinct mutations were observed and are catalogued in the present work. The WHO/CDC, U.S.A., classification and definitions for the current variants being monitored (VBM) and variant of concern (VOC) are assigned to the SARS-CoV-2 spike protein mutations identified in the present work along with a list of other amino acid substitutions observed for the variants. All 195 amino acid residues in receptor binding domain (Thr333-Pro527) were associated with mutations in SARS-CoV-2 spike protein sequence including Lys417, Tyr449, Tyr453, Ala475, Asn487, Thr500, Asn501 and Gly502 that make interactions with the ACE-2 receptor ≤3.2 Å distance as observed in the crystal structure complex available in the Protein Data Bank (PDB code:6LZG). However, not all these residues were mutated in the same spike protein. Especially, Gly502 mutated only in two spike protein sequences and Tyr449 mutated only in seven spike protein sequences among the spike protein sequences analysed constitute potential sites for the design of suitable inhibitors/drugs. Further, forty-four invariant residues were observed that correspond to ten domains/regions in the SARS-CoV-2 spike protein and some of the residues exposed to the protein surface amongst these may serve as epitope targets to develop monoclonal antibodies.

Entities: Chemical

Keywords: Drug design sites; Epitope sites; Human SARS-CoV-2 mutations; Invariant sites; Mutation propensity

Year: 2022 PMID： 35156058 PMCID： PMC8824715 DOI： 10.1016/j.crstbi.2022.01.002

Source DB: PubMed Journal: Curr Res Struct Biol ISSN： 2665-928X

Introduction

The outbreak of the ongoing COVID-19 pandemic disease caused due to the human SARS-CoV-2 infection was first reported from the city of Wuhan, Hubei-1 province, China, during December 2019 (Wu et al., 2020). The disease has since, spread rapidly all across the world causing serious infections to millions of people and leading to the loss of several human lives (https://www.worldometers.info/coronavirus/). The SARS-CoV-2 that belongs to the Coronaviridae family, subfamily Orthocoronavirinae and β-CoV genera (https://www.ncbi.nlm.nih.gov/taxonomy/694009) is a 30 kb positive-stranded RNA viral genome comprising genes translated into structural and non-structural proteins. One of the proteins, the spike glycoprotein (S-protein), which is a homotrimer presents itself on the surface of the virion as a ‘crown’ and is involved in the recognition of human host cell surface ACE-2 receptor, an essential requirement for viral-host cellular membranes fusion and transfer of the viral nucleocapsid into host cells (Zhang et al., 2020). The SARS-CoV-2 is known to have its origins in bats (Zhou et al., 2020) and transmitted to humans via pangolins intermediate host species (Han, 2020; Lam et al., 2020; Guruprasad, 2020a, Guruprasad, 2020c,d). The disease is currently known to spread mainly via human-to-human contact through respiratory droplets released in air while coughing or sneezing by infected persons or via contact with virus contaminated surfaces. The spike protein comprises an N-terminal S1 subunit and a C-terminal membrane proximal S2 subunit. The S1 subunit contains four domains; S1A, S1B, S1C and S1D. The S1A or N-terminal domain (NTD), recognises sialic acid carbohydrate required for attachment of the virus to the host cell surface and the S1B or the receptor-binding domain (RBD) interacts with the human ACE-2 receptor (Zhang et al., 2020; Wang et al., 2020a). The S2 subunit comprises three long α-helices, multiple α-helical segments, extended twisted β-sheets, membrane spanning α-helix and an intracellular cysteine rich segment (Guruprasad, 2021). A furin-cleavage site is present between the S1 and S2 subunits represented by a ‘PRRA’ sequence motif and another proteolytic cleavage site S2’, in the S2 subunit upstream of the fusion peptide (Ou et al., 2020). These cleavage sites play a role in entry of the virus into host cells. Currently, there are no approved drugs to specifically treat COVID-19 patients. However, certain known drugs to treat other diseases have been approved under emergency use authorization (EUA) by the U.S. Food and Drugs Administration (F.D.A) to treat COVID-19 under strict medical supervision. The antiviral drugs; Remdesivir (Veklury), favipiravir (Avigan), rheumatoid arthritis drug; barcitinib (Olumiant), monoclonal antibodies; combinations of bamlanivimab and etesevimab by Eli Lilly U.S.A., and casirivimab and imdevimab by Regeneron, U.S.A., are some of the drugs in use and there are several different therapies being researched (https://www.mayoclinic.org/, https://www.goodrx.com/). The vaccines approved by the W.H.O. (U.S.A.) are: Moderna COVID-19 (mRNA-1273) (U.S.A.), Oxford/AstraZeneca COVID-19 (U.K. and Sweden), Johnson & Johnson COVID-19 (U.S.A.), Pfizer BioNTech COVID-19 (U.S.A. and Germany). The other vaccines approved for use in one or more countries include; Oxford/AstraZeneca vaccine - COVISHIELD (manufactured by Serum Institute of India), COVAXIN developed by Bharat Biotech (India) in collaboration with ICMR, SPUTNIK V (Russia), Sinopharm COVID-19 (China), CUREVAC (Germany). A draft landscape and tracker of COVID-19 candidate vaccines currently under different stages of clinical trials and awaiting approvals is available at (https://www.who.int/publications/m/item/draft-landscape-of-covid-19-candidate-vaccines). Viruses are known to constantly evolve through mutations. The genetic differences between viruses associated with one or more mutations leads to genetic variants of the virus. Ever since the outbreak of the COVID-19 pandemic, the human SARS-CoV-2 has been undergoing several mutations. Sequences with similar variants are grouped into lineages and multiple lineages can have the same amino acid substitution. Also, different amino acid substitutions can be observed at the same mutation site. The Centers for Disease Prevention and Control (CDC), U.S.A., in collaboration with the SARS-CoV-2 Interagency Group (SIG), established under the US Department of Health and Human Sciences (HHS) designed a classification scheme that was recently revised to include a fourth class of variant classification named Variant Being Monitored (VBM) along with the previously defined three classes of SARS-CoV-2 variants, i.e., Variants Of Interest (VOI), Variants Of Concern (VOC) and Variant Of High Consequence (VOHC). The SARS-CoV-2 variant classification include definitions and attributes of the variants along with the resulting public health action. The attributes associated with the VOIs are: changes to receptor binding, reduced neutralization by antibodies generated against previous infection or vaccination, reduced efficacy of treatments, potential diagnostic impact or predicted increase in transmissibility or disease severity. The VOCs are defined where there is evidence of an increase in transmissibility, severe disease leading to hospitalization or deaths, significant reduction in neutralization by antibodies generated during previous infection or vaccination, reduced effectiveness of treatments or vaccines, or diagnostic failures. A variant of high consequence is defined as one which has clear evidence that prevention measures or medical countermeasures have significantly reduced effectiveness relative to previously circulating variants. The VBM class includes variants with substitutions of concern along with the previously designated VOIs and VOCs (https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-info.html). The Phylogenetic Assignment of Named Global Outbreak Lineages, a web-based application developed by the Centre for Genomic Pathogen Surveillance (CGPS) in South Cambridgeshire (https://www.sanger.ac.uk/collaboration/centre-global-pathogen-surveillance-cgps/) assists genomic epidemiology by implementing a dynamic nomenclature for SARS-CoV-2 lineages or (PANGO nomenclature) (Rambaut et al., 2020) available at (https://cov-lineages.org/lineage_list.html). The WHO proposed the use of labels corresponding to the different PANGO lineages comprising Greek alphabets assigned to the variants useful to track the transmission and spread of the SARS-CoV-2 including the VOC (https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/). At the time of communicating this manuscript, there are nine VBMs; alpha, beta, gamma, epsilon, eta, iota, kappa, zeta, mu and one VBM not yet classified with PANGO lineage (B.1.617.3) and 1 VOC; delta and none from the VOI or VOHC categories. A number of studies have been reported on the mutations in human SARS-CoV-2 spike proteins based on either infectivity and reactivity to a panel of neutralizing antibodies and sera from convalescent patients (Li et al., 2020), or computational analyses of the protein sequences (Guruprasad, 2020a, 2020b, 2021; Kaushal et al., 2020; Korber et al., 2020a; Emary et al., 2021; Phelan et al., 2020; Mercatelli and Giorgi, 2020; Yadav et al., 2020; Saha et al., 2020). The D614G mutation that established itself as the dominant form was one of the first documented in the US during the initial stages of the pandemic (Korber et al., 2020a), after circulating in Europe (Emary et al., 2021). It has been shown based on the neutralization studies with the spike protein receptor-binding domain monoclonal antibodies and convalescent sera from people with and without the mutation, that viruses with the D614G mutation spread more quickly than viruses without this mutation (Korber et al., 2020b) and that the mutation was not a hindrance to vaccine development (Wu et al., 2021; Shen et al., 2021). The variants characterisation through constant genomic surveillance, especially, in regions where there are viral surges, is useful to inform local outbreak investigations and understand national trends. Such studies aid in defining policies to contain spread of the virus, as well as, monitor the potential impact of the variants on the diagnostics, vaccines and therapeutics, so that variants with reduced susceptibility to treatments are detected quickly. In the present study, a catalogue of all the known mutations identified among 303,250 human SARS-CoV-2 spike protein sequences with respect to the reference spike protein sequence Wuhan-Hu-1 is presented. The mutations were classified according to the WHO/CDC variant classification and definitions. The other amino acid substitutions observed at the variant site are also listed. The amino acid residues in the viral human SARS-CoV-2 spike protein RBD that makes interactions with human host ACE-2 receptor were identified from the crystal structure complex available in the Protein Data Bank and the spike protein mutations analysed to suggest potential drug design sites. Further, the invariant amino acid residues associated with the different domains/regions in human SARS-CoV-2 spike proteins were analysed to suggest potential epitope sites. The analyses provides information on the distinct mutation sites observed in human SARS-CoV-2 spike protein with respect to the reference sequence, distribution of the total number of mutations observed at the distinct mutation sites and according to the different domains/regions in the spike protein sequence, mutation percentages, propensities for mutated residues and non-mutated sites corresponding to domains/regions in the spike protein, classification of the variants according to WHO labels along with other amino acid substitutions observed in the present work and identification of potential inhibitor/epitope sites for drug/antibody design.

Method

The human SARS CoV-2 spike protein (surface glycoprotein) sequences available in the NCBI virus database (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/) as on November 2, 2021 were obtained in the FASTA format. The spike protein sequences comprising 1,273 amino acid residues as in the reference protein sequence (NCBI Accession code: YP_009724390.1) were considered for the analyses. The mutations were identified by comparing individual spike protein sequences to the reference spike protein sequence in the human SARS-CoV-2 genome isolate collected from an infected individual during December 2019 in Wuhan, China (Wu et al., 2020). The software suite of programs developed at ABREAST™ (https://www.abreast.in) were used to identify and evaluate the total number of distinct mutations in the individual spike proteins, distribution of the total number of mutations and to calculate mutation propensities corresponding to the different domains/regions in the protein. The invariant residues were highlighted along the reference spike protein sequence including amino acid residues in the RBD that were ≤3.2 Å distance from ACE-2 receptor according to the three-dimensional crystal structure available in the Protein Data Bank (PDB) (Berman et al., 2000) (PDB code: 6LZG) (Wang et al., 2020b). The sPDBview (Guex and Peitsch, 1997) and PyMol softwares (DeLano, 2002) were used for the graphics visualization and to generate figures. The WHO classification of SARS CoV-2 variants was used according to the US government SARS-CoV-2 Interagency Group (SIG) Variant Classifications and Definitions by the Centers for Disease Control and Prevention (CDC, U.S.A) that defines four classes: Variant Being Monitored (VBM), Variant of Interest (VOI), Variant of Concern (VOC), Variant of High Consequence (VOHC) (https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-info.html). The different lineages in variant classification were according to the Phylogenetic Assignment of Named Global Outbreak Lineages (PANGO) nomenclature and were derived by consulting the website: (https://cov-lineages.org/lineage_list.html). Further, the characteristic mutations in the spike protein (S gene) were obtained by consulting the report for the specific Lineage (MullenGinger et al., 2020). The other amino acid substitutions observed in the present work corresponding to the variants are also reported.

Results and discussions

A total of 303,250 human SARS-CoV-2 spike protein sequences comprising 1,273 amino acid residues available in the NCBI Virus database (as on November 2, 2021) were analysed. These spike protein sequences contained a total of 1,269,629 mutations that represented 1,229 distinct mutation sites, suggesting ∼96.54% of the human SARS-CoV-2 spike protein sequence has undergone the mutations since outbreak of the COVID-19 pandemic during December 2019. The mutation sites represented a total of 4,729 distinct mutations. The total number of distinct mutation sites observed along with their distribution according to the different domains/regions in the human SARS-CoV-2 spike protein is listed in Table 1. A catalog of all the mutations identified is attached in the Supplementary Table 1.

Table 1

Distribution of total numbers of distinct mutation sites and mutations in domains/regions of the human SARS-CoV-2 spike proteins relative to the reference protein sequence.

Domains/Regions	Number of amino acid residues in domains/regions	Total number of distinct mutation sites in domains/regions	Total number of observed mutations
S1^A (NTD) (1-302)	302	302	422247
S1^A-S1^B linker (303-332)	30	30	2194
S1^B (RBD) (333-527)	195	195	246622
S1^B – S1^C linker (528-533)	6	6	190
S1^C domain (534-589)	56	56	5732
S1^C – S1^D linker (590-593)	4	4	85
S1^D domain (594-674)	81	81	322034
Protease cleavage site (675-692)	18	18	87140
S1–S2 subunits linker (693-710)	18	18	21631
Central β-strand (711-737)	27	26	19294
Downward helix (738-782)	45	41	5676
S2′ cleavage site (783-815)	33	32	3125
Fusion peptide (816-828)	13	13	357
Connecting region (829-911)	83	79	6745
Heptad repeat region (912-983)	72	62	53655
Central helix (984-1034)	51	40	21907
β-hairpin (1035–1068)	34	30	1225
β-sheet domain (1069-1133)	65	63	11775
Heptad repeat region (1134-1213)	80	75	26561
Transmembrane region (1214-1236)	23	23	3971
Cytoplasmic region (1237-1273)	37	35	7463

Distribution of total numbers of distinct mutation sites and mutations in domains/regions of the human SARS-CoV-2 spike proteins relative to the reference protein sequence. The distribution of the total number of mutations observed at the different mutation sites along the spike protein sequence is shown in Fig. 1A. The most frequent D614G mutation reported previously (Guruprasad, 2020a, 2021; Li et al., 2020) is predominant in the present dataset too and accounts for ∼23.27% of the total number of observed mutations, whereas, the mutation percentages at all the remaining sites in spike protein were <5.5%. The top 26 mutation sites in the dataset arranged in the decreasing order of mutation percentages is shown in Fig. 1B. These represent the mutations at: D614, L452, P681, T478, E484, D950, T19, T95, L5, W152, S13, D253, N501, L18, S477, P26, D138, T1027, A701, V1176, T20, H655, K417, R190, T732 and Q677.

Fig. 1

A. Distribution of the total number of mutations at the mutation sites in human SARS-CoV-2 spike protein sequence. B. Top 26 mutation sites in human SARS-CoV-2 proteins arranged in decreasing order of the total number of observed mutations. The classification of human SARS-CoV-2 mutations identified in the present work according to different PANGO lineages (https://cov-lineages.org/lineage_list.html) using the WHO Greek key classification labels is listed in Table 2. According to the variant classification and definitions by the CDC, U.S.A., (https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-info.html), these are grouped by the WHO into four categories; Variants Being Monitored (VBM), Variants of Interest (VOI), Variants of Concern (VOC), Variants of High Consequence (VOHC). Currently there are ten VBMs and 1 VOC with none in the VOI or VOHC categories. The table contains 46 variants (indicated in bold) along with other amino acid substitutions observed that may constitute a sub-lineage or a new lineage. The D614G mutation is associated with all the current VBMs and the VOC. The other amino acid substitutions observed at the most common mutated position 614 in human SARS-CoV-2 spike protein sequences were; D614N, D614A, D614S.

Table 2

The WHO classification of variants along with the other amino acid substitutions observed [mentioned within square brackets] represented among the 303,250 human SARS-CoV-2 spike protein sequences.

L5F	(Iota) [L5I, L5V, L5J, L5G]
S13I	(Epsilon) [S13Q, S13T, S13G, S13N, S13R, S13C]
L18F	(Gamma) [L18R, L18I, L18K, L18V, L18N, L18T, L18P]
T19R	(Delta) [T19I, T19L, T19K, T19S, T19A]
T20N	(Gamma) [T20I, T20R, T20S, T20A, T20P, T20F]
P26S	(Gamma) [P26L, P26H, P26A, P26F, P26R, P26Y, P26T]
A67V	(Eta) [A67S, A67T, A67I, A67G, A67P, A67H, A67D]
D80G	(Iota) [D80Y, D80A, D80F, D80H, D80N, D80C, D80W, D80P, D80B, D80R, D80E] D80A (Beta) [D80Y, D80G, D80F, D80H, D80N, D80C, D80W, D80P, D80B, D80R, D80E]
T95I	(Iota, Kappa, Delta) [T95E, T95A, T95N, T95S, T95K, T95P]
D138Y	(Gamma) [D138H, D138C, D138B, D138G, D138F, D138P, D138N, D138A, D138V]
G142D	(Kappa, Delta) [G142Y, G142F, G142S, G142V, G142A, G142C, G142L]
W152C	(Epsilon) [W152L, W152S, W152R, W152K, W152F]
E154K	(Kappa) [E154A, E154W, E154F, E154Q, E154D, E154G, E154V, E154S]
F157S	(Iota) [F157C, F157Y, F157L, F157V, F157I]
R158G	(Delta) [R158S, R158Y, R158K, R158L, R158I]
R190S	(Gamma) [R190V, R190M, R190F, R190K, R190W, R190N, R190L, R190G, R190I]
D215G	(Beta) [D215Y, D215A, D215H, D215P, D215N, D215R, D215E, D215V]
D253G	(Iota) [D253Y, D253N, D253S, D253A, D253V, D253H]
K417T	(Gamma) [K417N, K417R, K417A, K417E, K417M]
K417N	(Beta, Delta) [K417T, K417R, K417A, K417E, K417M]
L452R	(Epsilon, Iota, Kappa, Delta) [L452M, L452Q, L452W, L452P]
S477N	(Iota) [S477I, S477G, S477R, S477K, S477P, S477T, S477B]
T478K	(Delta) [T478R, T478C, T478I, T478A]
E484K	(Alpha, Beta, Gamma, Eta, Iota, Zeta, Mu) [E484Q, E484Z, E484G, E484A, E484D, E484F, E484R, E484V, E484S]
E484Q	(Kappa) [E484K, E484Z, E484G, E484A, E484D, E484F, E484R, E484V, E484S]
S494P	(Alpha) [S494G, S494L, S494R, S494T, S494A, S494Q]
N501Y	(Alpha, Beta, Gamma, Mu) [N501T, N501S, N501V, N501I, N501H, N501R, N501K]
A570D	(Alpha) [A570V, A570S, A570T, A570G]
D614G	(Alpha, Beta, Gamma, Epsilon, Eta, Iota, Kappa, Zeta, Mu, Delta) [D614N, D614S, D614A]
H655Y	(Gamma) [H655N, H655P, H655R, H655L]
Q677H	(Eta) [Q677P, Q677R, Q677E, Q677Y, Q677S, Q677L, Q677K]
P681H	(Alpha, Mu) [P681R, P681L, P681Y, P681S]
P681R	(Kappa, Delta), [P681H, P681L, P681Y, P681S]
A701V	(Beta, Iota) [A701T, A701S, A701E, A701I]
T716I	(Alpha) [T716P, T716S]
T859N	(Iota) [T859I, T859S]
F888L	(Eta) [F888S, F888V]
D950H	(Iota) [D950N, D950B, D950Y, D950A, D950E, D950S]
D950N	(Delta) [D950B, D950H, D950Y, D950A, D950E, D950S]
S982A	(Alpha) [S982L]
T1027I	(Gamma) [T1027A, T1027S, T1027N]
Q1071H	(Kappa) [Q1071L, Q1071R, Q1071Y]
D1118H	(Alpha) [D1118G, D1118A, D1118Y]
V1176F	(Zeta) [V1176I]
K1191N	(Alpha) [K1191R, K1191T, K1191E, K1191M, K1191Q]

The WHO classification of variants along with the other amino acid substitutions observed [mentioned within square brackets] represented among the 303,250 human SARS-CoV-2 spike protein sequences. Mutations were associated with 1,229 amino acid residue positions in the human SARS-CoV-2 spike protein distributed in the following domains/regions; S1A (NTD) domain (1-302), S1A-S1B linker (303-332), S1B (RBD) domain (333-527), S1B – S1C linker (528-533), S1C domain (534-589), S1C – S1D linker (590-593), S1D domain (594-674), protease cleavage site (675-692), S1–S2 subunits linker (693-710), fusion peptide (816-828), transmembrane region (1214-1236). The mutation percentages corresponding to the different domains/regions is shown in Fig. 2. Accordingly, relatively high mutation percentages were associated with the NTD domain (33.25%), S1D domain (25.36%) and RBD domain (19.42%) compared with the other domains suggesting these domains may be more susceptible to the mutations. The main contributors to the high mutation percentages in these domains (Fig. 1B) were mutations at: T19, T95, L5, W152, S13, D253, L18, P26, D138, T20, R190 (NTD domain), H655 (S1D domain) and L452, T478, E484, N501, S477, K417 (RBD domain). The S1A (NTD) domain is known to contain two regions; M153ESEFR158 and S247YLTPG252 specific to human SARS-CoV-2 compared to SARS-CoV, both of which via their glycosylated spike proteins recognize the human angiotensin converting enzyme-2 (ACE-2) receptor (Guruprasad, 2020b). Although, all residues at the positions M153 to R158 and S247 to G252 are known to be mutated (Supplementary Table 1), it is interesting to note that the residue W152 immediately preceding the first region and the residue D253 following the second region were associated with relatively high mutation percentages and are among the top 26 mutations in the spike protein (Fig. 1B). However, residues in the above two regions mentioned were associated with far fewer mutations, suggesting perhaps the importance of these regions in the recognition of ACE-2 receptor by the human SARS-CoV-2 spike protein.

Fig. 2

Mutation percentages in different domains/regions of the human SARS-CoV-2 spike proteins.

Mutation percentages in different domains/regions of the human SARS-CoV-2 spike proteins. The mutation propensity distribution according to the domains/regions shown in Fig. 3, shows the protease cleavage site comprising only 18 amino acid residues is associated with a maximum value (4.85), followed by the S1D domain comprising 81 amino acid residues with mutation propensity value (3.98). The mutations at P681 (with 66,197 mutations) and at D614 (with 295,479 mutations) out of a total of 1,269,629 observed mainly contribute to the high propensity values for the protease cleavage site and S1D domain, respectively. The variants associated with the protease cleavage site (residues 675-692) according to WHO classification were; Q677H (Eta), P681H (Alpha, Mu), P681R (Kappa, Delta) and variants associated with the S1D domain (residues 594-674) were; D614G (Alpha, Beta, Gamma, Epsilon, Eta, Iota, Kappa, Zeta, Mu, Delta) and H655Y (Gamma). The mutations observed at the furin-cleavage site that plays a role in virus-host cell entry represented by the ‘PRRA’ sequence motif were; P681R, P681H, P681L, P681Y, P681S, R682Q, R682L, R682W, R682G, R682P, R683W, R683L, R683Q, R683P, A684V, A684S, A684T, A684E, A684P.

Fig. 3

Amino acid mutation propensity corresponding to different domains/regions in human SARS-CoV-2 spike proteins.

Amino acid mutation propensity corresponding to different domains/regions in human SARS-CoV-2 spike proteins. Currently, two anti-SARS-CoV-2 monoclonal antibody treatments, in the U.S.A., for healthcare providers with FDA Emergency Use Authorization (EUA) for the treatment of COVID-19 are: bamlanivimab plus etesevimab (https://www.fda.gov/media/145802/download) and casirivimab plus imdevimab (https://www.fda.gov/media/143892/download) and in laboratory studies, SARS-CoV-2 variants containing L452R or E484K substitution in the spike protein were reported to cause a marked reduction in susceptibility to bamlanivimab and possibly lower sensitivity to etesevimab and casirivimab and have therefore been defined as SARS-CoV-2 substitutions of therapeutic concern according to the CDC, U.S.A. (https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/variant-surveillance/variant-info.html). These mutations were observed in the present analyses (Supplementary Table 1) and the additional substitutions observed were: L452M, L452Q, L452W, L452P and E484Q, E484Z, E484G, E484A, E484D, E484F, E484R, E484V and E484S. According to the WHO classification, the L452R variant is associated with Epsilon, Iota, Kappa, Delta (Table 2), E484K is associated with Alpha, Beta, Gamma, Eta, Iota, Zeta, Mu and E484Q variant is associated with Kappa. In order to verify correlation between in-silico prediction and experimental evidence, the literature was examined to infer mutations that affect neutralizing antibodies. A comprehensive map of the spike protein RBD escape mutants against the monoclonal antibodies; LY-CoV016 (etesevimab) (Starr et al., 2021a) and LY-CoV555 (bamlanivimab) along with a combination cocktail of the two antibodies have been reported (Starr et al., 2021b). Accordingly, these antibody binding are known to be affected due to mutations at the following twenty-three sites; D405, E406, K417, D420, Y421, L452, L455, F456, N460, I472, Y473, A475, G476, V483, E484, G485, F486, N487, Y489, F490, Q493, S494, G504. Mutations at all these sites were observed in the present work and the different mutations are listed in Supplementary Table 1. Among the mutations that escape either monoclonal antibody or their combination, the E484K for the LY-CoV555 antibody and K417N for the LY-CoV016 antibody show >1000 fold change in the IC50 values according to the neutralization assays (Wang et al., 2021). The mutations at E484 and K417 are among the six spike protein RBD mutations ≤3.2 Å interacting distance of the ACE-2 receptor as observed in the crystal structure complex (PDB code:6LZG). The L452R mutation, which is the predominant spike protein RBD mutation at position 452 is also within interacting distance of the ACE-2 receptor and is known to escape the LY-CoV555 antibody (Starr et al., 2021b). The SARS-CoV-2 spike protein mutations reported here taken together with experimental studies on the escape mutants for the antibodies and their cocktail combinations suggest that more specific drugs/antibodies need to be developed for COVID-19. The amino acid residue sites that have not undergone any mutation during the past two years since outbreak of the COVID-19 pandemic disease in SARS-CoV-2 spike protein or those in the RBD that interact with ACE-2 receptor and known to have undergone relatively least mutations may serve as useful targets for the design. The SARS-CoV-2 spike protein residues that have remained invariant since the outbreak of COVID-19 pandemic ∼2 years ago, are listed in Table 3 along with their associated domains/regions. These residue sites highlighted (in green) in Fig. 4, constitute ∼3.4% of the human SARS-CoV-2 spike glycoprotein sequence. The propensity values evaluated for the non-mutated sites in the spike protein according to the domain/region is shown in Fig. 5. The invariant residues were associated with ten domains/regions; the central helix with maximum propensity value (6.24) followed by heptad repeat region (residues 912-983) (4.01), β-hairpin region (3.4), downward helix (2.57), heptad repeat region (residues 1134-1213) (1.8), cytoplasmic region (1.56), connecting region (1.39), central β-strand (1.07), β-sheet domain (0.89) and the S2′ cleavage site (0.87). The three-dimensional electron microscopy structure of the SARS-CoV-2 spike glycoprotein (closed state) comprising the three subunits (PDB code: 6VXX_A-chain, B-chain, C-chain) (Walls et al., 2020) was used to map the position of invariant residues. Based on the molecular visualization in graphics using PyMol, some of the invariant residues representing different domains/regions that were exposed on the protein surface are shown in Fig. 6(A-D). These correspond to some of the residues in the heptad repeat region (residues 912-983) listed in Table 3. The residues; Gln920, Gln935, Asn953, Asn955 are shown highlighted (red) in the three-dimensional structure of the protein (PDB code: 6VXX_A-chain) in Fig. 6A and the equivalent residues were highlighted in the B-chain (green) and in C-chain (blue). The hydrophobic residues; Leu959 in the heptad repeat region and Ile1013 in central helix that are in close proximity in three-dimensional structure of the spike protein and the glycosylated residue; Asn801 in S2’ cleavage site is also exposed on the protein surface. The heptad repeat region (residues 912-983) and the central helix (residues 984-1034) are associated with relatively high propensity values for the invariant residue sites in the human SARS-CoV-2 spike protein as shown in Fig. 5. Also, the invariant hydrophobic residue; Leu1145 (in heptad repeat region 1134-1213) from all the three protein subunits form a hydrophobic cluster at the base of the spike protein as shown in Fig. 6B. It is noted that residues beyond this region are not defined in the three-dimensional structure for the individual chains. Further, the invariant residue; Cys1126 in the β-sheet domain (1069-1133) close to the N-acetyl-D-glucosamine (NAG) sites (NAG1319, NAG1320) in the A-chain are also exposed on the exterior of the protein surface as shown in Fig. 6C. The invariant residues; Glu988 and Gln992 in the central helix were exposed within the interior on the protein surface as shown in Fig. 6D. These invariant sites exposed on the protein surface may serve as potential epitope targets for antibody/inhibitor design. The epitopes may be constituted by the invariant residues contributed from more than one chain among the three spike protein subunits; A-chain (red), B-chain (green) and C-chain (blue) as shown in Fig. 6B.

Table 3

Domains/Regions	Non-mutated amino acid residues
Central β-strand (711-737)	F718
Downward helix (738-782)	C743, S746, C749, F782
S2′ cleavage site (783-815)	N801
Connecting region (829-911)	C840 (x), L878, G880, Q901
Heptad repeat region (912-983)	Q920, Q935, N953, N955, L959, L962, L966, F970, S974, L977
Central helix (984-1034)	E988, Q992, L996, R1000, L1004, Y1007, Q1010, I1013, K1028, M1029, C1032
β-hairpin (1035–1068)	Q1036, S1051, Q1054, H1064
β-sheet domain (1069-1133)	T1105, C1126
Heptad repeat region (1134-1213)	L1145, F1148 (x), L1152 (x), L1193 (x), Y1209 (x)
Cytoplasmic region (1237-1273)	F1256 (x), K1269 (x)

Fig. 4

Invariant amino acid residue positions and ACE-2 interacting sites in human SARS-CoV-2 spike protein.

Fig. 5

Non-mutated site propensities corresponding to the different domains/regions in human SARS-CoV-2 spike protein.

Fig. 6

A. Surface representation of the human SARS-CoV-2 spike protein three-dimensional structure (PDB code:6VXX) with A-chain (red), B-chain (green), C-chain (blue) showing invariant residues exposed on the protein surface in heptad repeat region (912-983), S2′ cleavage site (783-815) and central helix (984-1034). B. View showing proximity of invariant hydrophobic residues from different protein chains in heptad repeat region (1134-1213) exposed on the protein surface. C. View showing β-sheet domain (1069-1133) invariant residue close to glycosylation site exposed on the protein surface. D. View showing central helix (984-1034) invariant residues exposed on the protein surface. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)

Domain association of the non-mutated amino acid residues among 303,250 human SARS-CoV-2 spike proteins with reference to the Wuhan-Hu-1 spike protein sequence. (x) Indicates missing residues in three-dimensional structure. Invariant amino acid residue positions and ACE-2 interacting sites in human SARS-CoV-2 spike protein. Non-mutated site propensities corresponding to the different domains/regions in human SARS-CoV-2 spike protein. A. Surface representation of the human SARS-CoV-2 spike protein three-dimensional structure (PDB code:6VXX) with A-chain (red), B-chain (green), C-chain (blue) showing invariant residues exposed on the protein surface in heptad repeat region (912-983), S2′ cleavage site (783-815) and central helix (984-1034). B. View showing proximity of invariant hydrophobic residues from different protein chains in heptad repeat region (1134-1213) exposed on the protein surface. C. View showing β-sheet domain (1069-1133) invariant residue close to glycosylation site exposed on the protein surface. D. View showing central helix (984-1034) invariant residues exposed on the protein surface. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.) The human SARS-CoV-2 spike protein receptor binding domain (RBD) comprising amino acid residues Thr333-Pro527 is involved in the interactions with human host cell ACE-2 receptor. All 195 amino acid residues in RBD have undergone the mutations and the top six mutation sites in SARS-CoV-2 spike protein along with their mutation percentages were; L452 (5.5%), T478 (3.77%), E484 (3.32%), N501 (1.95%), S477 (1.8%), K417 (1.54%). The rest of the mutation sites in the RBD domain were associated with ≤0.37%. The L452 mutation site also ranked second among the top 26 mutation sites observed in the overall SARS-CoV-2 spike protein (Fig. 1B) suggesting high susceptibility to the mutations at these sites relative to the other residues in spike protein RBD. In the crystal structure of the human SARS-CoV-2 RBD protein complexed with ACE-2 receptor (PDB code: 6LZG) (Wang et al., 2020c), the amino acid residue atomic interactions between the spike protein RBD and ACE-2 receptor defined by interactions ≤3.2 Å were; Lys417(NZ)-Asp30(OD1), Tyr449(OH)-Asp38(OD1), Tyr449(OH)-Gln42(NE2), Tyr453(OH)-His34(ND1), Ala475(O)-Ser19(OG), Asn487(OD1)-Gln24(NE2), Asn487(ND2)-Tyr83(OH), Thr500(OG1)-Tyr41(OH), Asn501(ND2)-Tyr41(OH) and Gly502(N)-Lys353(O). The interacting amino acid residues in the spike protein are highlighted (in cyan) in Fig. 4. Except Lys417, none of the spike protein RBD residues involved in the interactions with ACE2 were among the top six mutations mentioned above. Therefore, most of these residue interactions may play an important role in host cell infection and virus transmission and their mutations are likely to affect the human SARS-CoV-2 spike protein and ACE-2 interactions. Only two Gly502 mutations were observed out of the total 1,269,629 mutations among 303,250 human SARS-CoV-2 spike proteins. These were; G502E in the spike protein with NCBI Accession code (UAT83124.1) and G502V in (QTY96446.1). Likewise, only five Tyr449 mutations were observed that correspond to 7 proteins. These were; Y449S (QWN56156.1, QZJ78555.1), Y449D (QNH88954.1), Y449N (QJD23270.1, UDB67143.1), Y449H (UCQ96089.1), Y449F (UCK95525.1). An examination of the other mutations in spike proteins (NCBI Accession codes) mentioned above comprising the Gly502 and Tyr449 mutations revealed that none of the individual spike proteins contained mutations involving all residues that make interactions with the ACE-2 receptor as shown in Table 4. This observation suggests that all the SARS-CoV-2 spike protein RBD – ACE-2 receptor interactions outlined earlier in this work may not be necessary for virus-host cell transmission. The Tyr449 makes side-chain to side-chain hydrogen bond interactions with Asp38 and Gln42 as shown in Fig. 7A. The Asp38 and Gln42 residues are associated with helix conformation in the ACE-2 receptor (as in PDB code: 6LZG). This tyrosine residue represented in the ‘VGGNY’ sequence is one of the two sequences in human SARS-CoV-2 spike protein RBD previously identified among structural determinants for recognition of the human ACE-2 receptor (Guruprasad, 2020b). The amide nitrogen of Gly502 is involved in main-chain to main-chain hydrogen-bond with carbonyl oxygen of Lys353 as shown in Fig. 7B. The Asn487 makes side-chain interactions with the side-chains of Gln24 and Tyr83 as shown in Fig. 7C. Asn487 was observed to be mutated among twenty-four proteins in the dataset analysed. While all the sites of interacting residues in human SARS-CoV-2 spike protein RBD with human ACE-2 receptor offer attractive targets for inhibitor/drug design, Gly502 and Tyr449 sites may be more promising considering their relatively low mutation rates during the past two years. Therefore, inhibiting interactions made by the human SARS-CoV-2 spike protein RBD with the ACE-2 receptor are suggested as potential sites for COVID-19 drug design.

Table 4

Other mutations in human SARS-CoV-2 spike proteins corresponding to the NCBI Accession codes containing the RBD mutations at Y449 and G502.

NCBI Accession codes	Mutations in human SARS-CoV-2 spike proteins comprising the mutations at Y449 and G502
QWN56156.1	V70L, Y449S, A570D, D614G, P681H, T716I, S982A, D1118H
QZJ78555.1	I68T, T95I, Y449S, D614G
QNH88954.1	Y449D, D614G
QJD23270.1	Y449N
UDB67143.1	T95I, Y144S, Y145N, R346K, Y449N, E484K, N501Y, E583D, D614G, P681H, D950N
UCQ96089.1	T19R, Y449H, L452R, T478K. D614G, P681R, D950N
UCK95525.1	P209S, S359T, Y449F, G799D
UAT83124.1	T19R, L452R, T478K, N501T, G502E, V503C, G504K, D614G, P681R, D950N
QTY96446.1	T478K, G502V, D614G, P681H, T732A

Fig. 7

A. Tyr449 side-chain interactions ≤3.2 Å in spike protein RBD with ACE2 receptor (PDB code:6LZG). B. Gly502 main-chain interactions ≤3.2 Å in spike protein RBD with ACE2 receptor (PDB code:6LZG). C. Asn487 side-chain interactions ≤3.2 Å in spike protein RBD with ACE2 receptor (PDB code:6LZG).

Other mutations in human SARS-CoV-2 spike proteins corresponding to the NCBI Accession codes containing the RBD mutations at Y449 and G502. A. Tyr449 side-chain interactions ≤3.2 Å in spike protein RBD with ACE2 receptor (PDB code:6LZG). B. Gly502 main-chain interactions ≤3.2 Å in spike protein RBD with ACE2 receptor (PDB code:6LZG). C. Asn487 side-chain interactions ≤3.2 Å in spike protein RBD with ACE2 receptor (PDB code:6LZG). The amino acid substitutions observed in human SARS-CoV-2 spike proteins corresponding to the variants classified according to WHO suggest, variants that may be examined to delineate whether they are likely to affect virus transmission, cause severity of infection, escape host immune recognition or efficacy of existing vaccines and monoclonal antibodies. The epitope targets and drug design sites proposed in this work may be examined for further development of drugs and monoclonal antibodies.

Conclusions

The comparison of 303,250 human SARS-CoV-2 spike protein sequences of length comprising 1,273 amino acid residues with the first reported human SARS-CoV-2 spike protein reference sequence Wuhan-Hu-1, showed 1,229 distinct mutation sites that represented 4,729 distinct mutations. All these mutations are catalogued in the present work along with the classification of the variants according to the WHO definitions of Variants Being Monitored (VBMs) and Variant Of Concern (VOC). The other amino acid substitutions observed for the variants are also presented. The D614G mutation in spike protein S1D domain is common to Alpha, Beta, Gamma, Epsilon, Eta, Iota, Kappa, Zeta, Mu, Delta variants that represent the current VBMs and VOC. The protease cleavage site (amino acid residues; 675-692) was associated with the maximum mutation propensity (4.85). Forty-four sites or nearly 3.45% of the human SARS-CoV-2 spike protein sequence has not undergone any mutations since outbreak of the COVID-19 pandemic. The invariant residue sites were associated with ten domains/regions of the SARS-CoV-2 spike protein with a maximum propensity value (6.24) for the central helix (984-1034), followed by the heptad repeat region (912-983) with propensity value (4.01) and the beta-hairpin (1035-1068) with propensity value (3.4). Some of the invariant site residues identified that are exposed to the protein surface may serve as potential epitope targets for monoclonal antibodies or inhibitor design. Eight residues; Lys417, Tyr449, Tyr453, Ala475, Asn487, Thr500, Asn501 and Gly502 in human SARS-CoV-2 spike protein RBD that are within 3.2 Å interacting distance of the human ACE-2 receptor may serve as potential sites for the design of inhibitors/drugs with the least mutated residues; Gly502 and Tyr449 sites more promising.

Funding

None.

CRediT authorship contribution statement

Kunchur Guruprasad: Formal analysis, Writing – original draft.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

25 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling.

Authors: N Guex; M C Peitsch
Journal: Electrophoresis Date: 1997-12 Impact factor: 3.535

3. Identifying SARS-CoV-2-related coronaviruses in Malayan pangolins.

Authors: Tommy Tsan-Yuk Lam; Na Jia; Ya-Wei Zhang; Marcus Ho-Hin Shum; Jia-Fu Jiang; Yi-Gang Tong; Hua-Chen Zhu; Yong-Xia Shi; Xue-Bing Ni; Yun-Shi Liao; Wen-Juan Li; Bao-Gui Jiang; Wei Wei; Ting-Ting Yuan; Kui Zheng; Xiao-Ming Cui; Jie Li; Guang-Qian Pei; Xin Qiang; William Yiu-Man Cheung; Lian-Feng Li; Fang-Fang Sun; Si Qin; Ji-Cheng Huang; Gabriel M Leung; Edward C Holmes; Yan-Ling Hu; Yi Guan; Wu-Chun Cao
Journal: Nature Date: 2020-03-26 Impact factor: 49.962

4. Angiotensin-converting enzyme 2 (ACE2) as a SARS-CoV-2 receptor: molecular mechanisms and potential therapeutic target.

Authors: Haibo Zhang; Josef M Penninger; Yimin Li; Nanshan Zhong; Arthur S Slutsky
Journal: Intensive Care Med Date: 2020-03-03 Impact factor: 17.440

5. SARS-CoV-2 variant B.1.1.7 is susceptible to neutralizing antibodies elicited by ancestral spike vaccines.

Authors: Xiaoying Shen; Haili Tang; Charlene McDanal; Kshitij Wagh; William Fischer; James Theiler; Hyejin Yoon; Dapeng Li; Barton F Haynes; Kevin O Sanders; Sandrasegaram Gnanakaran; Nick Hengartner; Rolando Pajon; Gale Smith; Gregory M Glenn; Bette Korber; David C Montefiori
Journal: Cell Host Microbe Date: 2021-03-05 Impact factor: 31.316

6. Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein.

Authors: Alexandra C Walls; Young-Jun Park; M Alejandra Tortorici; Abigail Wall; Andrew T McGuire; David Veesler
Journal: Cell Date: 2020-03-09 Impact factor: 41.582

7. Evolutionary relationships and sequence-structure determinants in human SARS coronavirus-2 spike proteins for host receptor recognition.

Authors: Lalitha Guruprasad
Journal: Proteins Date: 2020-07-04

8. Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus.

Authors: Bette Korber; Will M Fischer; Sandrasegaram Gnanakaran; Hyejin Yoon; James Theiler; Werner Abfalterer; Nick Hengartner; Elena E Giorgi; Tanmoy Bhattacharya; Brian Foley; Kathryn M Hastie; Matthew D Parker; David G Partridge; Cariad M Evans; Timothy M Freeman; Thushan I de Silva; Charlene McDanal; Lautaro G Perez; Haili Tang; Alex Moon-Walker; Sean P Whelan; Celia C LaBranche; Erica O Saphire; David C Montefiori
Journal: Cell Date: 2020-07-03 Impact factor: 66.850

9. Pangolins Harbor SARS-CoV-2-Related Coronaviruses.

Authors: Guan-Zhu Han
Journal: Trends Microbiol Date: 2020-04-06 Impact factor: 18.230

5 in total

1. Do the Successive Waves of SARS-CoV-2, Vaccination Status and Place of Infection Influence the Clinical Picture and COVID-19 Severity among Patients with Persistent Clinical Symptoms? The Retrospective Study of Patients from the STOP-COVID Registry of the PoLoCOV-Study.

Authors: Michał Chudzik; Mateusz Babicki; Joanna Kapusta; Damian Kołat; Żaneta Kałuzińska; Agnieszka Mastalerz-Migas; Piotr Jankowski
Journal: J Pers Med Date: 2022-04-28

Review 2. Current Evidence in SARS-CoV-2 mRNA Vaccines and Post-Vaccination Adverse Reports: Knowns and Unknowns.

Authors: Dimitra S Mouliou; Efthimios Dardiotis
Journal: Diagnostics (Basel) Date: 2022-06-26

3. Dynamics of Viral Infection and Evolution of SARS-CoV-2 Variants in the Calabria Area of Southern Italy.

Authors: Carmela De Marco; Claudia Veneziano; Alice Massacci; Matteo Pallocca; Nadia Marascio; Angela Quirino; Giorgio Settimo Barreca; Aida Giancotti; Luigia Gallo; Angelo Giuseppe Lamberti; Barbara Quaresima; Gianluca Santamaria; Flavia Biamonte; Stefania Scicchitano; Enrico Maria Trecarichi; Alessandro Russo; Daniele Torella; Aldo Quattrone; Carlo Torti; Giovanni Matera; Caterina De Filippo; Francesco Saverio Costanzo; Giuseppe Viglietto
Journal: Front Microbiol Date: 2022-07-28 Impact factor: 6.064

4. Analysis of SARS-CoV-2 viral loads in stool samples and nasopharyngeal swabs from COVID-19 patients in the United Arab Emirates.

Authors: Mariane Daou; Hussein Kannout; Mariam Khalili; Mohamed Almarei; Mohamed Alhashami; Zainab Alhalwachi; Fatima Alshamsi; Mohammad Tahseen Al Bataineh; Mohd Azzam Kayasseh; Abdulmajeed Al Khajeh; Shadi W Hasan; Guan K Tay; Samuel F Feng; Dymitr Ruta; Ahmed F Yousef; Habiba S Alsafar
Journal: PLoS One Date: 2022-09-22 Impact factor: 3.752

5. The Flexible, Extended Coil of the PDZ-Binding Motif of the Three Deadly Human Coronavirus E Proteins Plays a Role in Pathogenicity.

Authors: Dewald Schoeman; Ruben Cloete; Burtram C Fielding
Journal: Viruses Date: 2022-08-02 Impact factor: 5.818

5 in total