Literature DB >> 32954666

Comprehensive annotations of the mutational spectra of SARS-CoV-2 spike protein: a fast and accurate pipeline.

Mohammad Shaminur Rahman¹, Mohammad Rafiul Islam¹, Mohammad Nazmul Hoque^1,2, Abu Sayed Mohammad Rubayet Ul Alam³, Masuda Akther¹, Joynob Akter Puspo¹, Salma Akter^1,4, Azraf Anwar⁵, Munawar Sultana¹, Mohammad Anwar Hossain¹.

Abstract

Infecting millions of people, the SARS-CoV-2 is evolving at an unprecedented rate, demanding advanced and specified analytic pipeline to capture the mutational spectra. In order to explore mutations and deletions in the spike (S) protein - the most-discussed protein of SARS-CoV-2 - we comprehensively analyzed 35,750 complete S protein-coding sequences through a custom Python-based pipeline. This GISAID-collected dataset of until 24 June 2020 covered six continents and five major climate zones. We identified 27,801 (77.77% sequences) mutated strains compared to reference Wuhan-Hu-1 wherein 84.40% of these strains mutated by only a single amino acid (aa). An outlier strain (EPI_ISL_463893) from Bosnia and Herzegovina possessed six aa substitutions. We also identified 11 residues with high aa mutation frequency, and each contains four types of aa variations. The infamous D614G variant has spread worldwide with ever-rising dominance and across regions with different climatic conditions alongside L5F and D936Y mutants, which have been documented throughout all regions and climate zones, respectively. We also found 988 unique aa substitutions spanned across 660 residues, which differed significantly among different continents (p = .003) and climatic zones (p = .021) as inferred with the Kruskal-Wallis test. Besides, 17 in-frame deletions at four sites adjacent to receptor-binding-domain were determined that may have a possible impact on attenuation. This study provides a fast and accurate pipeline for identifying mutations and deletions from the large dataset for coding and also non-coding sequences as evidenced by the representative analysis on existing S protein data. By using separate multi-sequence alignment, removing ambiguous sequences and in-frame stop codons, and utilizing pairwise alignment, this method can derive both synonymous and non-synonymous mutations (strain_ID reference aa:mutation position:strain aa). We suggest that the pipeline will aid in the evolutionary surveillance of any SARS-CoV-2 encoded proteins and will prove to be crucial in tracking the ever-increasing variation of many other divergent RNA viruses in the future. The code is available at https://github.com/SShaminur/Mutation-Analysis.

Entities: Chemical

Keywords: Climate; Geography; Mutations; SARS-CoV-2; Spike (S) protein | COVID-19

Mesh：

Substances：

Year: 2020 PMID： 32954666 PMCID： PMC7646266 DOI： 10.1111/tbed.13834

Source DB: PubMed Journal: Transbound Emerg Dis ISSN： 1865-1674 Impact factor: 4.521

INTRODUCTION

Mutations in the viral genomes serve as the building blocks of viral evolution and remain the main reason for the novelty in evolution (Baer, 2008; Duffy, 2018). However, mutations in the viral genomes are not restricted to their replication since they can also result from spontaneous nucleic acid damage over time in different host populations or from editing of the genetic materials. Thus, a large portion of mutations, either at nucleotides (nt) and/or change in amino acids (aa) levels, are harmful (Loewe & Hill, 2010). RNA viruses like SARS‐CoV‐2 generally have higher mutation rates; however, a few of these mutations are correlated with differential virulence, evolving ability, and traits considered beneficial for viruses (Duffy, 2018; Islam et al., 2020). Inherent high mutation rate of SARS‐CoV‐2 has already produced many descendants from the original Wuhan strain, which complicates its genotyping. The ability of the structural proteins especially spike protein, in different strains of the SARS‐CoV‐2 to undergo rapid changes have enabled their genomes to emerge in novel hosts, escape vaccine‐induced immunity and evolve in diverse geo‐climatic conditions (Duffy, 2018; Islam et al., 2020; Loewe & Hill, 2010). Moreover, spontaneous mutation is a key parameter in modelling the genetic structure and evolution of populations (Drake & Holland, 1999). Therefore, investigation of the increased rate of non‐synonymous mutations in the SARS‐CoV‐2 genomes could be an important tool in assessing the genetic health of the populations. SARS‐CoV‐2 comprises of four major structural proteins—specifically spike (S) glycoproteins, envelope (E) proteins, membrane (M) proteins and nucleocapsid (N) proteins (Ahmed et al., 2020; Rahman et al., 2020; Wu et al., 2020). The entry of SARS‐CoV‐2 into the host cells is mediated by the transmembrane S protein which consists of two functional subunits responsible for binding to the host cell receptor (S1 subunit), and for fusing the viral and cellular membranes (S2 subunit) (Walls et al., 2020). The higher antigenic and surface exposure properties of the S protein facilitate the attachment and entry of viral particles into the host cells through the host angiotensin‐converting enzyme 2 (ACE2) receptor (Grant et al., 2020; Shang et al., 2020; Zhou et al., 2019). Therefore, the spike contains highest variations and determines, to some extent, the viral host range (Coutard et al., 2020; Wu et al., 2020). Furthermore, the S protein is the main target of neutralizing antibodies (Abs) upon infection and is thus one of the most important structures for therapeutics and vaccine design (Rahman et al., 2020; Walls et al., 2020). The continuing rapid transmission and global spread of COVID‐19 have raised intriguing questions regarding the evolution and adaptation of SARS‐CoV‐2 in diverse geographic and climatic conditions driven by non‐synonymous mutations, deletions and/or replacements (Bal et al., 2020; Islam et al., 2020; Pachetti et al., 2020). The capability of the different strains of SARS‐CoV‐2 strains for swiftly adapting to diverse environments could be linked with their geographic distributions. Though not yet well studied, evidence suggests that the transmission of SARS‐CoV‐2 infections and per day mortality rate from this infection is positively associated with weather conditions, and the diurnal temperature range (DTR) (Su et al., 2016; Islam et al., 2020). However, the exact role of geo‐climatic conditions on SARS‐CoV‐2 is unknown, but it would be worth keeping in mind that this novel disease originated from wildlife before spreading to humans (Harvey, 2020). Therefore, genomic mutation analysis of SARS‐CoV‐2 strains, integrated with geographic and climatic data, would provide a fuller understanding of the origin, dispersal and dynamics of the evolving SARS‐CoV‐2 virus. Although several reports predicted possible adaptations at the nucleotide and aa level, along with structural heterogeneity in viral proteins, especially in the S protein (Armijos‐Jaramillo et al., 2020; Islam et al., 2020; Phan, 2020; Sardar et al., 2020), most of these studies were carried out few complete representative genomes from a limited geographic area. As the genome number is increasing day by day, regular in‐house monitoring of the crucial components such as the S protein is urgently necessary to understand the genomic basis and evolution of the diagnostic RT‐PCR primer. There are a few pipelines (Yin, 2020) and websites (https://mendel.bii.astar.edu.sg/METHODS/corona/beta/MUTATIONS/hCoV19_Human_2019_WuhanWIV04/hCoV‐19_Spike_new_mutations_table.html) in GSAID where aa change or substitution can be observed. In order to provide an alternative tool with a wider range of functions, we present an easy rapid pipeline that will assist in the alignment of large volumes of viral genomes, remove low‐quality sequences and in‐frame stop codons and provide in‐house non‐synonymous mutation analysis of large volumes of sequences while requiring minimal knowledge of the command line. This tool can perform this analysis for any other proteins as required. This study aimed to investigate the mutational spectra of aa utilizing this novel methodology in the S proteins in 35,750 complete genome sequences of the SARS‐CoV‐2 belonging to 135 countries and/regions, and five climatic zones around the world, retrieved from the global initiative on sharing all influenza data (GISAID) (https://www.gisaid.org/) up to 24 June 2020 (Data S1).

MATERIALS AND METHODS

Genomic data collection and processing

To decipher the genetic variations of the S glycoprotein, we retrieved 53,981 complete (or near‐complete) genome sequences of SARS‐CoV‐2, available at the global initiative on sharing all influenza data (GISAID) (https://www.gisaid.org/) up to 24 June 2020. These sequences belonged to infected patients from 135 countries and/or regions from across six continents (Data S1). Using pyfasta (https://github.com/brentp/pyfasta), we split the total genome into 6 separate files having around 8,900 sequences in each. We aligned each file through the MAFFT (maximum limit 10,000 sequences) online server (https://mafft.cbrc.jp/alignment/server/add_fragments.html?frommanual) using default parameters (Katoh et al., 2002). The complete genome sequence of SARS‐CoV‐2 Wuhan‐Hu‐1 strain (Accession NC_045512, Version NC_045512.2) was used as a reference genome.

Mutation frequency analysis

MEGA 7 was used to differentiate the spike protein of SARS‐CoV‐2 from multiple sequence alignment (Sudhir Kumar et al., 2016). Sequence cleaner (https://github.com/metageni/Sequence‐Cleaner) with set parameters of minimum length (m = 3,822), percentage N (mn = 0), keep_all_duplicates, and remove_ambiguous was employed to remove all ambiguous, and low‐quality sequences. We utilized SeqKit toolkit (seqkit grep ‐s ‐p "‐" in.fa > out.fa) to apprehend gap containing strains for deletion analysis (Shen et al., 2016). Internal stop codon containing sequences were removed by using SEquence DAtaset builder (SEDA; https://www.sing‐group.org/seda/). Amino acid mutation analysis was done with bio‐python program using pairwise alignment (https://github.com/SShaminur/Mutation‐Analysis). The custom Venn diagrams (http://bioinformatics.psb.ugent.be/webtools/Venn/) server was used to make the Venn diagrams, and visualize the data. Swiss‐Model, a structure homology‐modelling server (https://swissmodel.expasy.org/), was used to predict the 3D structure (template, PDB ID:6VSB) of the S protein of the reference genome, and the structure was visualized in PyMOL (DeLano, 2002; Rahman et al., 2020; Waterhouse et al., 2018). Furthermore, we divided the S glycoprotein mutation of SARS‐CoV‐2 data according to their geographic origins from six continents—Europe, Asia, North America, South America, Africa and Australia, and five related climatic zones—temperate, tropical, diverse, dry and continental (Kissler, Tedijanto, Goldstein, & Yonatan, 2020). To estimate the case fatality (mortality) rates of SARS‐CoV‐2 infections, we collected information on total infected cases and total reported deaths in these countries from the World Health Organization (WHO) COVID‐19 Reports up to 24 June 2020 (WHO Reports, 2020).

Pipeline validations

The overview of the methods is described in Figure 1. The SARS‐CoV‐2 genomes are increasing very rapidly in the Global initiative on sharing all influenza data (GISAID), but not all genomes are of high quality or complete. So, non‐synonymous mutation analysis with particular crucial part of the virus like S or other structural protein gives statistically more significant insights rather considering the complete genome of the SARS‐CoV‐2 virus. Of the total S protein sequences, sequence cleaner removed 33.77% of the low‐quality or ambiguous sequences. Of the rest cleaned sequences (66.23%), we found ten in‐frame stop codon containing sequences which were eventually removed using SEDA (https://www.sing‐group.org/seda/manual/operations). SeqKit toolkit was used to arrest gap containing sequences which identified around 453 sequences, and we also carefully checked the in‐frame deletion, and 103 strains containing in‐frame deletions. SNP‐sites is a very efficient tools for nucleotide variation detection in different format like multi‐fasta alignment, variant call format (VCF), and relaxed phylip format (Page et al., 2016) but this tool is highly dedicated for nucleotide. Snippy (Seemann, 2015) is another tool where nucleotide and protein variation can also be detected, but for large data set with ambiguous sequences will require a separate processing to entrust more accurate results. This pipeline gives the non‐synonymous mutation results in a file format (Strain_ID Reference_aa:Mutation_Position:Strain_aa) that will assist in the downstream analysis like unique mutation, unique position mutation, mutational frequency and strains having number of mutation (Figure 2). Moreover, synonymous mutations analysis for large datatset can also be applied by this tool. For deletion analysis, this pipeline helped in decreasing the size of sequences (just 453 sequences from 53,981 sequences). Details of the current pipeline and coding are deposited in the Github (https://github.com/SShaminur/Mutation‐Analysis).

Figure 1

Figure 2

Overview of the input and results output representing changes in aa position (white background) in different strains of SARS‐CoV‐2 in regard to reference genome [Colour figure can be viewed at wileyonlinelibrary.com]

Workflow of the pipeline used for non‐synonymous mutation analyses in this study. File splitting needs if the number of sequences is more than 10,000. Through these methods, nucleotide mutations can also be calculated. Here: MSA: multiple sequence alignment, and ORFs: open reading frames [Colour figure can be viewed at wileyonlinelibrary.com] Overview of the input and results output representing changes in aa position (white background) in different strains of SARS‐CoV‐2 in regard to reference genome [Colour figure can be viewed at wileyonlinelibrary.com]

Statistical analysis

Wu–Kabat variability coefficient was employed to calculate the aa position variability in regard to evolutionary adaptation (Garcia‐Boronat et al., 2008; Kabat et al., 1977). The variability coefficient was calculated using the following formula: N = total number of sequences in the alignment, k = number of different aa at a given position, and n = frequency of the most common aa at that position. We used Microsoft Excel 2016 to calculate the frequency, percentages, Wu–Kabat variability coefficient calculation using the above mentioned formula and overall data management (David, 2017). Wu–Kabat variability coefficient plot was visualized in RStudio by using ggplot2 package (Wickham, 2011). Frequency lolliplot was also visualized in RStudio with the trackViewer Vignette package (https://bioconductor.org/packages/release/bioc/vignettes/trackViewer/inst/doc/trackViewer.htm) (Ou et al., 2020a,2020b). To measure the morbidity and case fatality rates, and association between the S protein mutational spectra and case fatality rates, we applied non‐parametric test Kruskal–Wallis rank sum test (Hoque et al., 2019) using IBM SPSS (SPSS, version 23.0, IBM Corp., NY USA).

RESULTS AND DISCUSSIONS

Geo‐climatic distribution of strains

Trimming of the low‐quality, ambiguous and non‐human host RNA sequences resulted in 35,750 (66.23%) cleaned and full‐length S protein sequences (Data S1). These sequences belonged to 135 countries and/or regions from six continents (Europe, Asia, North America, South America, Africa and Australia) and five major climatic zones (temperate, tropical, diverse, dry and continental) around the world (Data S1). European countries and/or regions had the highest percentage (58.90%) of S protein sequences, followed by North American (25.78%), Asian (9.34%), Australian (3.61%), South American (1.21%) and African (1.18%) countries or regions. On the other hand, the temperate climatic zone covered the majority of these S protein sequences (60.18%), followed by diverse (33.08%), continental (3.25%), tropical (2.81%) and dry (0.69%) climatic conditions (Data S1). We selected the complete genome sequence SARS‐CoV‐2 Wuhan‐Hu‐1 strain (Accession NC_045512, Version NC_045512.2) as a reference genome. Through non‐synonymous mutations analysis, we found 27,801 (77.77%) mutated strains of the SARS‐CoV‐2 in the cleaned sequences (n = 35,750). Furthermore, country or region‐specific aa change patterns revealed the highest number of mutated SARS‐CoV‐2 strains in England (7,067) followed by USA (6,501), Wales (3,002), Scotland (1,463), Netherlands (1,194), Australia (681), Belgium (596) and Denmark (582) (Data S1).

Evolutionary footprint in spike

Our mutational analyses revealed a total of 988 unique amino acid (aa) substitutions distributed across 660 positions of SARS‐CoV‐2 S protein. Among these positions, 250 showed two or more aa variations in a certain position (Figure 3a, Table 1, Data S2). The primary structure of the S protein is 1,274 aa; of them, 51.81% aa positions (n = 660) undergo aa‐level evolution worldwide. The positions‐specific aa variability of S protein was visualized in Wu–Kabat protein variability plot (Figure 3b). The current variability analysis identified 19 positions showing Wu–Kabat variability coefficient ˃4 indicating high variability of these positions. However, 614 (48.19%) positions had coefficient 1 indicating invariability of the positions compared to the reference strain of SARS‐CoV‐2 (Figure 3b). Remarkably, we found eleven highly variable sites (position: 32, 142, 146, 215, 261, 477, 529, 570, 622, 778, 791, 1,146, 1,162) showing four types of aa variations in a single position (Table 1). We also found that positions 52, 185 and 410 in the S glycoprotein underwent to 3, 2 and 1 aa substitutions, respectively (Table 1, Data S2). Notably, position 614 showed two variants, substitution D614G (Aspartic acid ˃ Glycine) found in ⁓74.82% (n = 26,749) of the cleaned sequences (⁓96.22% of the mutated sequences), and another variant D614N (Aspartic acid ˃ Asparagine) was observed only in four strains from England and Wales (EPI_ISL_439400, EPI_ISL_443658 and EPI_ISL_445498, EPI_ISL_472913). The variant D614G in the S protein has overcome the wild‐type variant from China since its first appearance in Germany on 28 January 2020 (Comandatore et al., 2020; Eaaswarkhanth et al., 2020; Kim et al., 2020; Korber et al., 2020; Trucchi et al., 2020). Moreover, variant frequencies of a recurrent pattern of G614 increase at multiple geographic levels: national, regional and municipal has also been reported through dynamic tracking. The shift might occur even in local epidemics where the original D614 form was well established prior to introduction of the G614 variant (Korber et al., 2020).

Figure 3

Table 1

Amino acid variations in the S protein of the SARS‐CoV‐2 according to their position

Position in S	Number of variations	Name of amino acid
19	3	T19P, T19I, T19S
21	3	R21I, R21T, R21K
22	3	T22N, T22I, T22A
26	3	P26L, P26S, P26R
27	3	A27V, A27T, A27S
32	4	F32L, F32Y, F32I, F32V
72	3	G72E, G72W, G72R
75	3	G75D, G75V, G75R
80	3	D80N, D80Y, D80A
97	3	K97E, K97N, K97R
102	3	R102S, R102I, R102G
142	4	G142D, G142A, G142V, G142S
146	4	H146Q, H146N, H146Y, H146R
148	3	N148Y, N148K, N148S
153	3	M153T, M153I, M153V
183	3	Q183H, Q183R, Q183L
215	4	D215Y, D215H, D215G, D215N
218	3	Q218R, Q218E, Q218L
222	3	A222V, A222S, A222P
239	3	Q239K, Q239R, Q239H
246	3	R246I, R246K, R246S
247	3	S247R, S247I, S247N
251	3	P251S, P251H, P251L
261	4	G261V, G261S, G261D, G261R
263	3	A263T, A263S, A263V
273	3	R273M, R273K, R273S
354	3	N354D, N354K, N354S
414	3	Q414R, Q414K, Q414P
468	3	I468F, I468T, I468V
477	4	S477I, S477N, S477R, S477G
483	3	V483F, V483I, V483A
529	4	K529M, K529N, K529R, K529E
558	3	K558N, K558Q, K558R
570	4	A570S, A570V, A570D, A570T
615	3	V615I, V615F, V615L
622	4	V622F, V622L, V622I, V622A, A623V
654	3	E654D, E654Q, E654K
675	3	Q675H, Q675R, Q675K
677	3	Q677H, Q677R, Q677Y
681	3	P681H, P681L, P681S
684	3	A684V, A684T, A684S
747	3	T747A, T747I, T747N
750	3	S750N, S750R, S750I
752	3	L752I, L752R, L752F
765	3	R765L, R765H, R765C
772	3	V772L, V772I, V772A
778	4	T778S, T778A, T778N, T778I
780	3	E780D, E780Q, E780V
791	4	T791I, T791A, T791K, T791P
812	3	P812S, P812T, P812L
831	3	A831S, A831V, A831T
836	3	Q836H, Q836P, Q836L
838	3	G838S, G838V, G838D
839	3	D839Y, D839E, D839N
845	3	A845S, A845V, A845D
847	3	R847T, R847I, R847K
870	3	I870S, I870T, I870V
879	3	A879S, A879V, A879T
930	3	A930S, A930V, A930T
1,085	3	G1085R, G1085E, G1085L
1,129	3	V1129L, V1129A, V1129I
1,146	4	D1146Y, D1146H, D1146E, D1146N
1,153	3	D1153A, D1153H, D1153Y
1,162	4	P1162L, P1162T, P1162A, P1162S
1,170	3	S1170T, S1170Y, S1170P

Here, the position(s) where more than 2 variations occurred are represented.

Mutational mapping and Wu–Kabat variability analysis of SARS‐CoV‐2 S protein. (a) Mapping and frequency distribution of recurrently occurred mutations in the S protein of ≥25 SARS‐CoV‐2 strains. Deletion sites of the S protein were also visualized in the Lolliplot graph. (b) Wu–Kabat protein variability coefficient plot of SARS‐CoV‐2 S protein. Here, variability coefficient 1 indicates the conservancy, whereas coefficients ˃1 indicate relative variability of the respective position. The more the coefficient value the more the variability or diversity [Colour figure can be viewed at wileyonlinelibrary.com] Amino acid variations in the S protein of the SARS‐CoV‐2 according to their position Here, the position(s) where more than 2 variations occurred are represented. A strain from Bosnia_and_Herzegovina (EPI_ISL_463893) had the highest number of aa changes/substitutions (n = 6) at six positions (R246I, L276I, T430A, D614G, S750N, L922V) of S protein. Also, we found that 84.8% (n = 23,576) of the mutated sequences carried just a single aa mutation throughout the S proteins. The remaining 13.44%, 1.63%, 0.11% and 0.01% of the mutated sequences contained 2, 3, 4 and 5 aa changes, respectively (Data S2). Moreover, we did not find any of such non‐synonymous aa mutation in the full‐length S protein of 18 countries and/or regions including Anhui, Brunei, Cambodia, Changzhou, Chongqing, Foshan, Ganzhou, Guam, Hefei, Jiangxi, Jingzhou, Jiujiang, Lishui, Nepal, Philippines, Qatar, Yingtan and Yunnan. This indicates S protein homogeneity of these countries/regions with the reference sequence from Wuhan, China (Data S1). The RBD region (Watanabe et al., 2020) (aa position: 338–530) showed non‐synonymous mutations at 82 different positions in 516 strains, whereas in the S1 site and S2 site, there were 362 and 297 positional recurrent mutations, respectively. Moreover, in the furin cleavage site (R685 and S686), we also observed a non‐synonymous mutation (S686G) in a single strain (Russia/Krasnodar‐63401/2020|EPI_ISL_428867|2020‐03‐11) (Data S2). We also found aa substitutions at six positions within the RBD region that are directly involved in binding with ACE‐2 receptor (Wang et al., 2020; Yuan et al., 2020) including N439K (Scotland, Romania), L455F (England), A475V (USA, Australia), and F456L, Q493L and N501Y (USA) (Data S2). All these mutations were found between March and April at a lower frequency (N439K with maximum frequency in 41 Scottish strains and one Romanian strain), except Q493L found in two USA strains reported in May. Q493R position showed variation in an English strain (EPI_ISL_470150) found in April. Furthermore, 18 substitutions at fourteen positions, previously reported to interact with anti‐SARS‐CoV‐2 antibody (Yuan et al., 2020), were found in the strains from Bangladesh, England, Portugal, Wales, Shanghai, France, USA, Scotland, Russia, Latvia, Netherlands, South Africa, Bosnia and Herzegovina, Belgium, Bosnia and Australia (Data S2) during the time frame March to May. Discontinuation of the mutants globally may be linked to reduction of virus pathogenicity and virulence fitness affecting transmission dynamics. However, the unavailability of these variants may result due to rejection of the variants with a lower ratio when generating the final consensus sequences and insufficient sequences reporting from unusual asymptomatic patients. Moreover, eight glycosylated sites of S protein underwent aa conversions including three substitutions in the NTD region (N17K, N74K and N149H), five substitutions at four sites in the S1 region (N17K, N74K, N149H, N603S and N603K) and four mutations in the S2 region (N717T, N1074D, N1158S and N1194S) (Watanabe et al., 2020). Furthermore, a total of 50 aa substitutions within the S protein were observed that incorporated asparagine (N) in S protein of SARS‐CoV‐2 including seven within the RBD region (S359N, K378N, K417N, K458N, S477N, T523N and K529N) (Data S2). These substitutions alter glycosylation sites and it nature, though it needs further investigations. Overall, the aa substitutions related to asparagine in the RBD (ACE binding domain) and/or in S1/2 domains nearer to the glycosylated sites may affect the glycosylation shield, folding of S protein, host–pathogen interactions, viral entry and finally immune modulation, thus affecting antibody recognition and viral pathogenicity (Ou et al., 2020a,2020b; Watanabe et al., 2020). Overall, these variability profiles may have notable implications in therapeutic and/or prophylactic interventions targeting the S protein of SARS‐CoV‐2.

In‐frame deletions resided adjacent S glycoprotein

Besides site‐specific mutations, our analysis revealed 17 in‐frame deletions of ranged nucleotides across the SARS‐CoV‐2 S protein sequences originating from different countries worldwide (Table 2, Data S2). Notably, we considered the deletions that occurred in at least two strains at a certain position as deletions. All of the identified deletions distributed throughout the nucleotide sequence 179–2035 fall into four major regions of S protein, that is nt position ranges 179–226 (61–76 aa: NVTWFHAIHVSGTNGT), 413–433 (138–144 aa: DPFFLGVY), 724–732 (241–244: LLAL) and 2021–2035 (675–679 aa: QTQTN). Amino acid deletions at positions 61–76, 138–144, and 241–244 are near the RBD region. Among them, deletions of positions 61–76 and 141–144 are surface exposed, but 241–244 are situated at the inner surface of the predicted S protein (Figure 4). Also, deleted aa at positions 675–679 are located in the C‐terminal transmembrane domain of S protein. Surface exposed deletions near the RBD region may have significant impact on host–pathogen interaction and immune modulation.

Table 2

Deletion sites observed across the S glycoprotein

Nucleotide position	Amino acid position	Deleted amino acid	Countries	Number of strains
179–217	61–73	NVTWFHAIHVSGT	England	1
200–226	68–76	IHVSGTNGT	Taiwan, Malaysia	2
201–224	68–75	IHVSGTNG	Thailand	1
203–208	69–70	HV	Sweden, England, Australia	3
413–421	138–140	DPF	Sweden	1
418–420	140	F	England, Sichuan	3
420–431	141–144	LGVY	England, Iceland, USA, Scotland, Kenya	16
420–422	141	L	England	1
422–430	141–143	LGV	Portugal, England, Iceland, Scotland	4
423–431	142–144	GVY	England, Netherlands	3
428–430	143	V	USA, Belgium	4
428–433	143–144	VY	England	2
429–431	145	Y	England, Canada, Slovenia, Jordan, Netherlands, Saudi_Arabia, Scotland, USA, Spain, Wales, India, Australia	48
724–732	241–243	LLA	China, England, Belgium, Scotland, Netherlands	6
724–726	241	L	USA	2
727–732	243–244	AL	England, Wales, Spain, Sichuan	6
2021–2035	675–679	QTQTN	Taiwan, Malaysia	2

Countries represent the origin of strains where the deletions found. We considered the deletions that occurred in at least two strains in a certain position.

Figure 4

Structural visualization of S protein deletion sites. The four aa deleted positions (61–76, 138–144, 241–244, and 675–679) in the S protein of the reference genome, SARS‐CoV‐2 Wuhan‐Hu‐1 strain (Accession NC_045512, Version NC_045512.2). The positions are visualized in the tertiary (3D) structure of S protein using PyMOl. The smudge, cyan and light orange colours represent the A, B and C chains of SARS‐CoV‐2 spike protein, respectively. Blue, yellow, magenta and red colours represent the aa deletion position of 61–76, 138–144, 241–244, and 675–679, respectively [Colour figure can be viewed at wileyonlinelibrary.com]

Deletion sites observed across the S glycoprotein Countries represent the origin of strains where the deletions found. We considered the deletions that occurred in at least two strains in a certain position. Structural visualization of S protein deletion sites. The four aa deleted positions (61–76, 138–144, 241–244, and 675–679) in the S protein of the reference genome, SARS‐CoV‐2 Wuhan‐Hu‐1 strain (Accession NC_045512, Version NC_045512.2). The positions are visualized in the tertiary (3D) structure of S protein using PyMOl. The smudge, cyan and light orange colours represent the A, B and C chains of SARS‐CoV‐2 spike protein, respectively. Blue, yellow, magenta and red colours represent the aa deletion position of 61–76, 138–144, 241–244, and 675–679, respectively [Colour figure can be viewed at wileyonlinelibrary.com] Among the deletions, nucleotide deletion positioned at 418–433 (aa position 140–144) faced frequent overlapped deletions among strains of multiple countries (Table 2). Notably, a single aa in‐frame deletion of nucleotides positioned 429–431 (aa position 145) with the highest frequency in 48 strains from multiple countries and/or regions including Australia, England, Canada, Slovenia, Jordan, Netherlands, Saudi_Arabia, Scotland, USA, Spain, Wales and India. A strain from Taiwan (EPI_ISL_444275) showed two coevolving deletions at nt positions 200–226 (68–76 aa:IHVSGTNGT) and nt positions 2021–2035 (675–679 aa:QTQTN). Moreover, two deletions at nt positions 418–420 (140 aa:F) and 727–732 (243–244 aa:AL) were coevolved in a Sichuan strain (EPI_ISL_451369). No other strain had such coevolving deletions, thereby indirectly indicating the negative impact of the deletions on virus fitness and human‐to‐human transmissibility. Noteworthy, a 5‐aa deletion (675–679 aa: QTQTN) at the upstream of the polybasic cleavage site of S1–S2 and a 21‐nt deletion 23596–23617 (aa‐NSPRRAR) including the polybasic cleavage site in clinical samples and cell‐isolated virus strain likely benefit the SARS‐CoV‐2 replication or infection in vitro and under strong purification selection in vivo (Liu et al., 2020). Moreover, attenuated SARS‐CoV‐2 variants with 15–30‐bp deletions (Del‐mut) at the S1/S2 junction were reported to show less virulence in an animal model (Lau et al., 2020). These deletions may affect viral adaptations to human, virus–host interactions for infections, attenuation, pathogenicity and immune modulations by potentially influencing the tertiary structures and functions of the associated proteins (Phan, 2020). However, further studies are required for the mechanistic clarification and functional implication of these deletions in the SARS‐CoV‐2 S glycoprotein. The deletion mutations identified in this study should be also considered for current vaccine development.

Geo‐climatic scenario of amino acid heterogeneity in the spike protein of SARS‐CoV‐2 and associated disease severity

Considering geo‐climatic impacts on aa changes in the S protein of the SARS‐CoV‐2, we sought to determine the possible residue positions, and total number of mutations in the S protein sequences from 135 countries and/or territories, and five climatic zones worldwide. Nine hundred and eighty‐eight (988) unique aa replacements across 660 positions along the S protein were identified which differed significantly among different continents (p = .003, Kruskal–Wallis test) and climatic zones (p = .021, Kruskal–Wallis test). We found that the frequency of aa changes in the S protein remained substantially higher in the SARS‐CoV‐2 genome sequences of Europe (62.02%), followed by North America (25.50%), Asia (6.83%), Australia (2.89%), South America (1.41%) and Africa (1.35%) (Figure 5a, Data S1). Among these replacements, aa residues at position 5 (L5F) and 614 (D614G) were found to be the common in Asia, Europe, North America, South America, Africa and Australia (Figure 5b). Moreover, 408, 127, 139, 17, 10 and 8 unique aa replacements (mutation that found only once in a sequence) were identified in the S protein sequences of Europe, Asia, North America, Australia, South America and Africa, respectively (Figure 5b, Data S3). In addition to unique aa mutations, 244, 146, 194, 61, 19 and 23 accessory aa replacements (mutations shared with at least two continents) were also found in the S protein of SARS‐CoV‐2 genomes sequenced from Europe, Asia, North America, Australia, South America and Africa, respectively (Figure 5b, Data S3). Significantly higher unique mutations in European (p = .0121, Kruskal–Wallis test), Asian (p = .0177, Kruskal–Wallis test) and American (p = .0391, Kruskal–Wallis test) sequences point out the geographic clustering pre‐disposition of the virus. However, further phylogenic study targeting those unique and accessory mutations may lead to a better understanding of global phylodynamics, and thereby guiding the regional control strategy for the COVID‐19 pandemic.

Figure 5

The frequency spectra of aa mutations in the S protein of the SARS‐CoV‐2. (a) Number of sequences, number of mutated sequences and number of aa mutations with respective to continent and climate region. Number of aa mutation in Africa, Asia, Australia, Europe, North America and South America were 33, 275, 80, 660, 335 and 31, respectively, and those were in continental, diverse, dry, temperate and tropical climatic conditions 107, 472, 17, 686 and 78, respectively. The aa mutations are represented according to (b) geographic areas and (c) different climate zones. We found two core shared aa mutations at residue position 5 (L5F) and 614 (D614G) in Asia, Europe, Africa, Australia, North America and South America, and two core shared mutations at residue positions of 614 (D614G) and 936 (D936Y) in continental, diverse, dry, tropical and temperate climatic conditions. In both cases (b and c), the middle brown circles represent frequency of aa substitutions shared by all variables, and the frequency of aa substitutions shared by at least two continents/climate zones are shown in white circle. The white coloured outer ribbons represent unique aa mutations in each individual region and climate zone [Colour figure can be viewed at wileyonlinelibrary.com] This study also explores the non‐synonymous mutations in the S protein of the SARS‐CoV‐2 genomes across five different climatic conditions worldwide. We found significant (p = .021, Kruskal–Wallis test) variations in aa mutation patterns in different climatic conditions keeping the highest (p = .017, Kruskal–Wallis test) frequency of unique aa mutations in the temperate region. Our analysis showed that only two core aa substitutions at positions 614 (D614G) and 936 (D936Y) were shared across all the climatic zones (Figure 5c). Furthermore, 426, 231, 29, 29 and 1 unique aa replacement were found in the S protein sequences of the temperate, diverse, tropical, continental and dry climatic conditions, respectively. In addition, we also identified 252, 239, 47, 76 and 14 shared aa replacements in temperate, diverse, tropical, continental and dry climatic conditions, respectively, where non‐synonymous mutations occurred in at least two climatic zones (Figure 5c, Data S3). RNA viruses like SARS‐CoV‐2 might have remarkable capabilities to adapt to new environments and confront different selective pressures they encounter (Watanabe et al., 2020). The mutational evolution geographic and climatic patterns of mutational evolution of SARS‐CoV‐2 S protein were visualized in Figure 6.

Figure 6

Variations in aa mutations across the S protein sequences of the SARS‐CoV‐2 according to different geographic regions (continental) (a‐f) and climatic conditions (g‐k). Top 45 high‐frequency mutations in each domain (continent or climatic zone) were considered for Lolliplot visualization. We found 614 positions where mutations occurred in 1,481, 377, 743, 16,842, 6,880 and 428 S protein sequences of Asia, Africa, Australia, Europe, North America and South America, respectively, and 959, 8,654, 156, 16,531 and 435 sequences in continental, diverse, dry, temperate and tropical condition, respectively. We normalized the graphs keeping position 614 as the highest mutational frequency in each graph [Colour figure can be viewed at wileyonlinelibrary.com] The genomic variability of SARS‐CoV‐2 strains manifested by mutations in the spike protein scattered across the globe underlay geographically specific aetiological effects. One important effect of mapping mutations is the development of antiviral therapies targeting specific regions, for example the spike region of the SARS‐CoV‐2 genomes (Callaway, 2020). Our current findings corroborate the study completed by Deshwal (2020), who reported the highest SARS‐CoV‐2 infections and case fatality rates in European countries. In another study, Pachetti et al. (2020) reported two non‐synonymous mutations (R203K and L3606F) that were shared across ORFs of the SARS‐CoV‐2 genomes of six continents, and recurrent mutations were also common in different countries along with unique mutations. Nevertheless, mutations in the structural proteins of the SARS‐CoV‐2, especially in the spike proteins, are driven by the geographic locations that diverged differently, possibly due to the environment, demography and the low fidelity of reverse transcriptase (Brassey et al., 2020; Pachetti et al., 2020; Su et al., 2016). In this study, we found 14.16%, 11.72%, 10.05%, 9.31%, 3.30%, 3.00%, 2.30%, 2.07%, 1.65% and 1.63% case fatality rates in United Kingdom Italy, France, Spain, Belgium, Germany, Russia, Netherlands, Sweden and Turkey, respectively (Data S3). Among the tropical Asian countries, higher mortality rates from SARS‐CoV‐2 infections were estimated in Iran (4.76%), India (4.72%), China (2.56%), Pakistan (1.38%) and Indonesia (1.11%), and rest of the countries had less than 1.0% case fatality rates. Moreover, in the diverse climatic conditions of the American countries or territories (both North and South Americans), the United States (5.67%) and Brazil (5.14%) had relatively higher mortality rates from SARS‐CoV‐2 pandemics, and rest of the countries in these continents had substantially lower disease severity rates (<1.0%). Case fatality or mortality rates from SARS‐CoV‐2 infections in rest of the two continents (Africa and Australia) remained much lower, and only 2.19%, 1.40% and 1.26% death rates were found in South Africa, Australia and Algeria, respectively. The rest of the countries and/or territories of these two continents had less than 1.0% mortality rates (Data S3). The predominantly higher mortality rates and unique aa mutations in the S protein sequences of the many European temperate countries might be associated with higher number of SARS‐CoV‐2 genome sequences deposited to the global databases like GISAID during that time compared to other continents. However, the correlation of such higher aa mutation frequency with viral pathogenesis needs to be ascertained. Moreover, it is worth noting that reported disease severity (may not represent the actual severity) might be affected by several other factors like healthcare facilities, average age group and genetic context of the population and control strategies adopted by the countries. Irrespective of the significance of geography for emerging infectious disease epidemiology, the effects of global mobility upon the genetic diversity, and molecular evolution of SARS‐CoV‐2 are under‐appreciated and only beginning to be understood. The recent monograph on the spatial epidemiology of COVID‐19 makes no reference to the genetic disparity of SARS‐CoV‐2 (Brassey et al., 2020; Harvey, 2020; Pachetti et al., 2020; Su et al., 2020).

Mutational comparison of the S proteins of SARS‐CoV‐2, SARS‐CoV and BatCoV

We compared the S protein mutations of the SARS‐CoV‐2 with the SARS‐CoV reference strain (NCBI accession no. NC_004718) and Bat coronavirus RaTG13 strain (NCBI accession no. MN996532). The identity, similarity and gap of the S protein between the Wuhan strain of the SARS‐CoV‐2 and RaTG13 were 97.3%, 98.3% and 0.4%, respectively, and those between the Wuhan strain SARS‐CoV‐2 and SARS‐CoV were 76.2%, 86.9% and 2.1%, respectively (Table S1). These findings are in line with many of the previously published reports (Swatantra Kumar et al., 2020; Tang et al., 2020; Wrapp et al., 2020). We found mutations in the variable regions between SARS‐CoV‐2 and RaTG13, and these recurrent mutations (S50L, T76I, A372T, N439K) are supposed to be converted to RaTG13 from SARS‐CoV‐2 (Table S1). Furthermore, we also found 45 mutation sites in the variable regions between the SARS‐CoV‐2 and SARS‐CoV which resulted in the conversion of SARS‐CoV‐2 to SARS‐CoV (Figure 7).

Figure 7

Lolliplot mapping of mutational conversion from SARS‐CoV‐2 to SARS‐CoV with their frequency. We identified 45 sites in the SARS‐CoV‐2 S protein with substitutions resulting in aa homogeneity with the S protein of SARS‐CoV [Colour figure can be viewed at wileyonlinelibrary.com] The RaTG13 genome possessed a deletion site (681–684 aa) in respect to the SARS‐CoV‐2 genome, and we also found deletions at a very close position (675–679 aa) in two strains of SARS‐CoV‐2 (Table 2). The SARS‐CoV also possessed deletions in respect to the Wuhan reference strain of the SARS‐CoV‐2 at aa positions (72–78, 144–147, 243–247, 256–257, 679–682). In this study, we also found deletion at different aa positions (61–76, 138–145, 241–244, 675–679) in different strains of SARS‐CoV‐2. Therefore, these types of deletions suggest that different strains of the SARS‐CoV‐2 are acquiring the traits of SARS‐CoV. Moreover, a recent study reported that the S1 protein of the Pangolin‐CoV is much more closely related to SARS‐CoV‐2 than to RaTG13 (Uddin et al., 2020). However, this phenomenon of evolving mutations and/or recurrent mutations should be interpreted using a larger dataset from different host populations and geo‐climatic conditions.

CONCLUSIONS

Our findings on non‐synonymous mutations in the spike protein of SARS‐CoV‐2 genomes suggest that the virus is continuously evolving. European, North American and Asian strains might coexist where each of them were characterized by a different mutation patterns. Moreover, the geo‐climatic distribution of the recurrent mutations in the spike deciphered a plausible link to higher mutations rates and disease severity in the European temperate countries. However, the geo‐climate effects of the observed mutations in the spike protein of SARS‐CoV‐2 on the properties of the diverse strain variants are yet to be evaluated in clinical or experimental studies. Therefore, these results need to be interpreted cautiously given the existing uncertainty about SARS‐CoV‐2 genomic data to develop potential prophylaxis and mitigation for tackling the COVID‐19 pandemic crisis. Therefore, the fast and accurate pipeline will help in an easy and accurate way to investigate the synonymous/non‐synonymous mutation, mutation frequency and deletion analysis from large number of data with a shortest possible time without having in‐depth bioinformatics knowledge.

CONFLICTS OF INTEREST

The authors of this manuscript declare that they have no conflict of interest.

AUTHOR CONTRIBUTIONS

MSR, MRI, MNH, ASMRUA, MA, JA and SA conducted the overall study. MSR, MRI and MNH drafted the manuscript. MNH finally compiled the manuscript. AA, MS and MAH contributed intellectually to the interpretation and presentation of the results.

ETHICAL APPROVAL

We confirm that the ethical policies of the journal, as noted on the journal's authors guideline page, have been adhered to. No ethical approval was required since the study did not include any animal or human sample. Table S1 Click here for additional data file. Data S1 Click here for additional data file. Data S2 Click here for additional data file. Data S3 Click here for additional data file.

43 in total

1. Mutation rates among RNA viruses.

Authors: J W Drake; J J Holland
Journal: Proc Natl Acad Sci U S A Date: 1999-11-23 Impact factor: 11.205

2. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.

Authors: Kazutaka Katoh; Kazuharu Misawa; Kei-ichi Kuma; Takashi Miyata
Journal: Nucleic Acids Res Date: 2002-07-15 Impact factor: 16.971

3. MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets.

Authors: Sudhir Kumar; Glen Stecher; Koichiro Tamura
Journal: Mol Biol Evol Date: 2016-03-22 Impact factor: 16.240

4. Coronavirus vaccines: five key questions as trials begin.

Authors: Ewen Callaway
Journal: Nature Date: 2020-03 Impact factor: 49.962

5. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation.

Authors: Wei Shen; Shuai Le; Yan Li; Fuquan Hu
Journal: PLoS One Date: 2016-10-05 Impact factor: 3.240

6. Attenuated SARS-CoV-2 variants with deletions at the S1/S2 junction.

Authors: Siu-Ying Lau; Pui Wang; Bobo Wing-Yee Mok; Anna Jinxia Zhang; Hin Chu; Andrew Chak-Yiu Lee; Shaofeng Deng; Pin Chen; Kwok-Hung Chan; Wenjun Song; Zhiwei Chen; Kelvin Kai-Wang To; Jasper Fuk-Woo Chan; Kwok-Yung Yuen; Honglin Chen
Journal: Emerg Microbes Infect Date: 2020-12 Impact factor: 7.163

7. Why are RNA virus mutation rates so damn high?

Authors: Siobain Duffy
Journal: PLoS Biol Date: 2018-08-13 Impact factor: 8.029

8. Projecting the transmission dynamics of SARS-CoV-2 through the postpandemic period.

Authors: Stephen M Kissler; Christine Tedijanto; Yonatan H Grad; Marc Lipsitch; Edward Goldstein
Journal: Science Date: 2020-04-14 Impact factor: 47.728

9. Molecular characterization of SARS-CoV-2 in the first COVID-19 cluster in France reveals an amino acid deletion in nsp2 (Asp268del).

Authors: A Bal; G Destras; A Gaymard; M Bouscambert-Duchamp; M Valette; V Escuret; E Frobert; G Billaud; S Trouillet-Assant; V Cheynet; K Brengel-Pesce; F Morfin; B Lina; L Josset
Journal: Clin Microbiol Infect Date: 2020-03-28 Impact factor: 8.067

10. Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein.

Authors: Alexandra C Walls; Young-Jun Park; M Alejandra Tortorici; Abigail Wall; Andrew T McGuire; David Veesler
Journal: Cell Date: 2020-03-09 Impact factor: 41.582

15 in total

1. SARS-CoV-2, Zika viruses and mycoplasma: Structure, pathogenesis and some treatment options in these emerging viral and bacterial infectious diseases.

Authors: Gonzalo Ferreira; Axel Santander; Florencia Savio; Mariana Guirado; Luis Sobrevia; Garth L Nicolson
Journal: Biochim Biophys Acta Mol Basis Dis Date: 2021-09-03 Impact factor: 5.187

2. Evolution of SARS-CoV-2 in Spain during the First Two Years of the Pandemic: Circulating Variants, Amino Acid Conservation, and Genetic Variability in Structural, Non-Structural, and Accessory Proteins.

Authors: Paloma Troyano-Hernáez; Roberto Reinosa; África Holguín
Journal: Int J Mol Sci Date: 2022-06-07 Impact factor: 6.208

3. Continent-wide evolutionary trends of emerging SARS-CoV-2 variants: dynamic profiles from Alpha to Omicron.

Authors: Chiranjib Chakraborty; Manojit Bhattacharya; Ashish Ranjan Sharma; Kuldeep Dhama; Sang-Soo Lee
Journal: Geroscience Date: 2022-07-13 Impact factor: 7.581

4. Diversity and genomic determinants of the microbiomes associated with COVID-19 and non-COVID respiratory diseases.

Authors: M Nazmul Hoque; M Shaminur Rahman; Rasel Ahmed; Md Sabbir Hossain; Md Shahidul Islam; Tofazzal Islam; M Anwar Hossain; Amam Zonaed Siddiki
Journal: Gene Rep Date: 2021-05-07

5. SARS-CoV-2: Possible recombination and emergence of potentially more virulent strains.

Authors: Dania Haddad; Sumi Elsa John; Anwar Mohammad; Maha M Hammad; Prashantha Hebbar; Arshad Channanath; Rasheeba Nizam; Sarah Al-Qabandi; Ashraf Al Madhoun; Abdullah Alshukry; Hamad Ali; Thangavel Alphonse Thanaraj; Fahd Al-Mulla
Journal: PLoS One Date: 2021-05-25 Impact factor: 3.240

6. PipeCoV: a pipeline for SARS-CoV-2 genome assembly, annotation and variant identification.

Authors: Renato R M Oliveira; Tatianne Costa Negri; Gisele Nunes; Inácio Medeiros; Guilherme Araújo; Fabricio de Oliveira Silva; Jorge Estefano Santana de Souza; Ronnie Alves; Guilherme Oliveira
Journal: PeerJ Date: 2022-04-13 Impact factor: 2.984

7. A rapid and cost-effective multiplex ARMS-PCR method for the simultaneous genotyping of the circulating SARS-CoV-2 phylogenetic clades.

Authors: Mohammad Tanvir Islam; Asm Rubayet Ul Alam; Najmuj Sakib; Mohammad Shazid Hasan; Tanay Chakrovarty; Mohammad Tawyabur; Ovinu Kibria Islam; Hassan M Al-Emran; Mohammad Iqbal Kabir Jahid; Mohammad Anwar Hossain
Journal: J Med Virol Date: 2021-02-01 Impact factor: 20.693

8. Role of Q675H Mutation in Improving SARS-CoV-2 Spike Interaction with the Furin Binding Pocket.

Authors: Anna Bertelli; Pasqualina D'Ursi; Giovanni Campisi; Serena Messali; Maria Milanesi; Marta Giovanetti; Massimo Ciccozzi; Francesca Caccuri; Arnaldo Caruso
Journal: Viruses Date: 2021-12-14 Impact factor: 5.048

9. Circulating Phylotypes of White Spot Syndrome Virus in Bangladesh and Their Virulence.

Authors: Mehedi Mahmudul Hasan; M Nazmul Hoque; Firoz Ahmed; Md Inja-Mamun Haque; Munawar Sultana; M Anwar Hossain
Journal: Microorganisms Date: 2022-01-16

10. Comprehensive annotations of the mutational spectra of SARS-CoV-2 spike protein: a fast and accurate pipeline.

Authors: Mohammad Shaminur Rahman; Mohammad Rafiul Islam; Mohammad Nazmul Hoque; Abu Sayed Mohammad Rubayet Ul Alam; Masuda Akther; Joynob Akter Puspo; Salma Akter; Azraf Anwar; Munawar Sultana; Mohammad Anwar Hossain
Journal: Transbound Emerg Dis Date: 2020-10-06 Impact factor: 4.521