Literature DB >> 33781798

Genome-wide analysis of 10664 SARS-CoV-2 genomes to identify virus strains in 73 countries based on single nucleotide polymorphism.

Nimisha Ghosh1, Indrajit Saha2, Nikhil Sharma3, Suman Nandi4, Dariusz Plewczynski5.   

Abstract

Since the onslaught of SARS-CoV-2, the research community has been searching for a vaccine to fight against this virus. However, during this period, the virus has mutated to adapt to the different environmental conditions in the world and made the task of vaccine design more challenging. In this situation, the identification of virus strains is very much timely and important task. We have performed genome-wide analysis of 10664 SARS-CoV-2 genomes of 73 countries to identify and prepare a Single Nucleotide Polymorphism (SNP) dataset of SARS-CoV-2. Thereafter, with the use of this SNP data, the advantage of hierarchical clustering is taken care of in such a way so that Average Linkage and Complete Linkage with Jaccard and Hamming distance functions are applied separately in order to identify the virus strains as clusters present in the SNP data. In this regard, the consensus of both the clustering results are also considered while Silhouette index is used as a cluster validity index to measure the goodness of the clusters as well to determine the number of clusters or virus strains. As a result, we have identified five major clusters or virus strains present worldwide. Apart from quantitative measures, these clusters are also visualized using Visual Assessment of Tendency (VAT) plot. The evolution of these clusters are also shown. Furthermore, top 10 signature SNPs are identified in each cluster and the non-synonymous signature SNPs are visualised in the respective protein structures. Also, the sequence and structural homology-based prediction along with the protein structural stability of these non-synonymous signature SNPs are reported in order to judge the characteristics of the identified clusters. As a consequence, T85I, Q57H and R203M in NSP2, ORF3a and Nucleocapsid respectively are found to be responsible for Cluster 1 as they are damaging and unstable non-synonymous signature SNPs. Similarly, F506L and S507C in Exon are responsible for both Clusters 3 and 4 while Clusters 2 and 5 do not exhibit such behaviour due to the absence of any non-synonymous signature SNPs. In addition to all these, the code, SNP dataset, 10664 labelled SARS-CoV-2 strains and additional results as supplementary are provided through our website for further use.
Copyright © 2021 Elsevier B.V. All rights reserved.

Entities:  

Keywords:  COVID-19; Clustering; Multiple sequence alignment; Non-synonymous SNP; SARS-CoV-2

Year:  2021        PMID: 33781798      PMCID: PMC7997709          DOI: 10.1016/j.virusres.2021.198401

Source DB:  PubMed          Journal:  Virus Res        ISSN: 0168-1702            Impact factor:   3.303


Introduction

SARS-CoV-2 is the causal agent for current ongoing outbreak of disease commonly known as COVID-19 (Zhou et al., 2020) which has proven to have a detrimental effect on the humankind. As a result, medical emergencies have surged and a halt of economic growth has occurred around the globe due to an eccentric impact of SARS-CoV-2. SARS-CoV-2 belongs to the family of Coronaviridae which also houses SARS-CoV-1 and MERS-CoV (van Dorp et al., 2020). First case of Severe Acute Respiratory Syndrome (SARS) was registered way back in 2002-03, which took around 8000 lives.2 Another pathogenic invasion was reported in 2012, named as Middle East Respiratory Syndrome Coronavirus (MERS-CoV) with a worldwide mortality rate of 35.5% (Ahmed, 2017). However, these two viruses have a significantly low transmission rate among the masses as compared to SARS-CoV-2 (Petersen et al., 2020). On the other hand, SARS-CoV-2 which emerged in late 2019 in China has now spread across the globe. On 11th March 2020, World Health Organization (WHO) declared COVID-19 as a global pandemic due to its high transmission rate and adverse effect on health care systems resulting in 112.2 million affected and 2485k deaths till now (Worldometer, 2021). In this current havoc situation, many organizations such as The University of Oxford (Folegatti et al., 2020), Bharat BioTech, Beijing Institute of Biological Products are witnessing their ongoing trials of designed vaccines respectively to curb the impact of this contagious virus. In fact, in India indigenously produced Covaxin by Bharat BioTech and Covishield (the local name for the Oxford-AstraZeneca vaccine developed in the UK) are already being disseminated among the masses. Vaccine development is a key component in the prevention of disease spread and reduction in the morbidity and mortality associated with the diseases by evoking an immune response in form of antigens against regions of proteins which are critical for pathogen binding (Amela et al., 2007). Since the emergence of SARS-CoV-2 in December 2019, many mutations such as substitutions and deletions in its coding and non-coding regions have been discovered (Phan, 2020) till date. Genome-wide analysis of 566 Indian SARS-CoV-2 genomes (Saha et al., 2020) revealed numerous mutation points as substitutions, deletions, insertions and single nucleotide polymorphism popularly known as SNP. Among all types of mutations, SNP (Nogales and Dieg, 2019, Yin, 2020) can be considered to be the source of variance in virus genomes giving a way for re-emergence, drug resistance and antibody escape for pathogens. SNPs were used in (Yang et al., 2020) to classify four super spreader (SS) clusters according to relative variants. According to their study, SS1 was widely spread in Asia and US, while SS4 was responsible for the pandemic in Europe. Hence, SNPs play a vital role in tracking virus strains and variations. Analysis of SNP was carried out in (Lo et al., 2018) as well where they suggested a change in the samples of Plasmodium falciparum between the northern and southern regions of Western Kenya; the samples from the southern part showed lesser divergence from each other. Furthermore, clustering has been used by many studies (Sevilla-Reyes et al., 2013, Fischer et al., 2018, Hahn et al., 2020) to understand genetic characteristics for a large set of genomes. To detect genetic diversity of non-structural (NS1) protein of influenza A virus, (Sevilla-Reyes et al., 2013) used clustering on the available protein sequence data and combined it with maximum likelihood phylogenetic RNA reconstruction and consensus WebLogo comparison. In (Fischer et al., 2018), Fischer et al. have used affinity propagation clustering to define clusters for rabies virus (RABV). Most recently, unsupervised cluster analysis of SARS-CoV-2 genomes have been carried out in (Hahn et al., 2020) by using principal component analysis to a similarity matrix to compare all pairs of 2540 nucleotides using Jaccard index. The analysed results were used to illustrate the geographic progression of the virus. Apart from this, for clustering of SARS-CoV-2 genomes, online web servers like GISAID CoVsurver3 and Pangolin4 exist. However, these servers take a substantial amount of time while performing clustering in order to generate phylogenetic tree as they consider all substitutions like mutation. This drawback has motivated us to perform the underlying clustering task faster and accurately on smaller and relevant features like SNP in order to identify SARS-CoV-2 virus strains of 10664 SARS-CoV-2 genomes. To address the clustering task, we have performed genome-wide analysis of 10664 SARS-CoV-2 genomes from 73 countries to determine the SNPs. SNPs represent mutation as substitution that occurs in more than 1% of the virus population for a given genomic position. In this regard, 107 SNPs are identified throughout the genome which are used to prepare a binary dataset. In order to identify the virus strains from this binary dataset, careful application of clustering methods with proper distance functions is very crucial. Thus, from our previous clustering experience, the use of hierarchical clustering such as Average Linkage and Complete Linkage with Jaccard and Hamming distances are appropriate in this context, while for computing the goodness of the clustering, the Silhouette index can be used with the same distance functions. Moreover, the consensus of the clustering results can be an added boost for the identification of the proper number of clusters as virus strains. To the best of our knowledge, this approach has not yet been applied for the identification of SARS-CoV-2 virus strains. This approach results in five major clusters as virus strains. Moreover, their presence in different countries are identified along with the evolution of the virus genomes from January to July 2020 for each of the 73 countries. These outcomes are shown quantitatively and visually through BioCircos (Cui et al., 2016), Visual Assessment of Tendency (VAT) plot (Bezdek and Hathaway, 2002, Kumar and Bezdek, 2020) and Heatmap (Deng et al., 2014). Therefore, the major contributions of this work can be summarised as: SNP identification from 10664 SARS-CoV-2 genomes of 73 countries, binary dataset creation from SNP data to find the number of clusters as virus strains present in 73 countries around the globe and their evolution, identifying signature SNPs in each cluster and determining the structural stability of the non-synonymous signature SNPs to judge the characteristics of the identified clusters.

Materials and methods

In this section, collection of SARS-CoV-2 genomes, Visual Assessment of Tendency (VAT) plot and the proposed pipeline with the preparation of SNPs data are discussed. For the ease of understanding for the readers, hierarchical clustering (Tou and Gonzalez, 1974, Devijver and Kittler, 1982) such as Average Linkage (Tou and Gonzalez, 1974, Devijver and Kittler, 1982) and Complete Linkage (Tou and Gonzalez, 1974) with Jaccard (Tou and Gonzalez, 1974, Devijver and Kittler, 1982) and Hamming distance (Tou and Gonzalez, 1974, Devijver and Kittler, 1982) functions, Silhouette index (Rousseeuw, 1987) as cluster validity measure are briefly discussed in supplementary.

Collection of SARS-CoV-2 genomes

Initially, 10664 complete and near complete SARS-CoV-2 genomes were collected from Global Initiative on Sharing All Influenza Data (GISAID)5 in fasta format while the Reference Genome (NC_045512.2)6 was collected from National Center for Biotechnology Information (NCBI). These genomic sequences are distributed in 73 countries starting from January till July 2020. This is important to note that GISAID contains many incomplete sequences or virus genomes which we have removed while preparing our sequence dataset. The dataset contains sequence ID and virus genome as a fasta format. The maximum and average length of the 10664 virus genomes are 29,903 and 29,821 bp respectively. Please note that the maximum length has been considered by taking the reference sequence. These 10664 SARS-CoV-2 sequences are aligned using multiple sequencing alignment (MSA) to prepare the SNP dataset. Please note that for the data visualization and editing BioEdit was used. For the alignment of sequences High Performance Computing facility of NITTTR, Kolkata was used and for the identification and preparation of SNP dataset MATLAB R2019b was used.

Visual Assessment of Tendency

The well-known Visual Assessment of Tendency (VAT) (Bezdek and Hathaway, 2002, Kumar and Bezdek, 2020) representation is used here to visualize the clusters formed by the clustering methods. This technique generally represents pairwise dissimilarity information of n data objects as an n × n image, where the data objects are reordered in such a way so that the resulting image is able to highlight potential cluster structure in the dataset. Therefore, the dataset is sorted first according to the cluster labels obtained after clustering. Subsequently, the distance matrix, e.g., Jaccard or Hamming, is computed for graphical representation. Boxes lying on the main diagonal represent the clusters structure.

Pipeline of the workflow

The pipeline of the workflow is shown in Fig. 1 (a). In order to find the virus strains, it is important to prepare the SNP dataset and then to use that datatset in hierarchical clustering with different distance functions to build a consensus of clustering results so that we can determine robust and deterministic number of clusters as virus strains that are present in the dataset. Initially to prepare the SNP dataset, 10664 SARS-CoV-2 sequences are aligned using multiple sequencing alignment (MSA) technique called Clustal Omega (ClustalO) (Sievers et al., 2011, Sievers and Higgins, 2014) in presence of reference sequence from NCBI. The choice of selecting ClustalO is taken because of its popularity, speed and accuracy. After performing ClustalO, a consensus sequence is built so that the mutation as substitution, so called SNPs that occur in more than 1% of the virus population for a given genomic position i.e. more than 106 viruses can be identified. The detection technique is shown in Fig. 1(b). Once such SNPs are identified, a SNP binary dataset of is prepared based on its presence in the virus sequences as shown in the third figure of Fig. 1(c). Thereafter, such binary dataset is used for hierarchical clustering using Average Linkage and Complete Linkage with the Jaccard and Hamming distance functions separately in iterative manner by considering number of Clusters 2 to 100. In each iteration, Silhouette Index is computed to measure the goodness of the clustering results. Once the number of clusters are determined by each of the four methods, the VAT plots of such clustering results are prepared to see the structure of clusters. Based on the VAT plots, two or more clustering results are compared to create a final consensus clustering solution which gives the robust and stable clusters as virus strains. After finding the clusters as virus strains, the presence of virus strains in different countries is identified as the distribution of the genomic sequences is known. Furthermore, top 10 signature SNPs in each cluster are identified based on their frequency of occurrence in the cluster.
Fig. 1

(a) Pipeline of the Workflow, (b) Detection technique to find mutation as substitution, (c) Bar plot to represent the frequency of SNPs at different genomic positions, BioCircos plot to represent SNPs with corresponding coding regions and example of SNP dataset as binary matrix 1 and 0 represents presence and absence of SNP in any specific sequence.

(a) Pipeline of the Workflow, (b) Detection technique to find mutation as substitution, (c) Bar plot to represent the frequency of SNPs at different genomic positions, BioCircos plot to represent SNPs with corresponding coding regions and example of SNP dataset as binary matrix 1 and 0 represents presence and absence of SNP in any specific sequence.

Results

The results of the experiments are explained here. Initially, 107 SNPs are identified in both coding and non-coding regions from the 10664 SARS-CoV-2 sequences. SNPs represent mutation as substitution that occurs in more than 1% of the virus population for a given genomic position. Such SNPs are shown using bar and BioCircos plots in the first and second figures of Fig. 1(c). In bar plot, the frequency of SNPs with their genomic locations are shown while the BioCircos plot represents the SNPs with the corresponding coding regions. In the first figure of Fig. 1(c), SNPs at coordinates 241, 3037, 14,408, 23,403 and 29,816 occur in more than 30% of the virus population, that is in more than 3199 genomes out of 10664. The first coordinate belongs to 5′-UTR, the next two are in ORF1ab, 23,403 belongs to Spike, while the last coordinate is in 3′-UTR. Thereafter, by considering the presence or absence of a SNP in the 10664 sequences, the corresponding binary dataset is created of size 10664 × 107 as shown in the third figure of Fig. 1(c). For each of the virus sequence, 1 represents the presence of SNP and 0 represents the absence of SNP at a particular genomic coordinate for such 107 identified SNPs. Once the SNP binary dataset is prepared, hierarchical clustering such as Average Linkage and Complete Linkage with Jaccard and Hamming distance functions are considered iteratively for number of clusters ranging from 2 to 100 to find the number of clusters as virus strains. These results are presented in Table 1 and Fig. 2 (a). It is evident from Fig. 2(a) that the silhouette values are highest at cluster numbers 8, 7, 5 and 3 respectively for Average Linkage with Jaccard distance, Average Linkage with Hamming distance, Complete Linkage with Jaccard distance and Complete Linkage with Hamming distance. The silhouette values for these clusters are reported in Table 1.
Table 1

Optimal number of clusters produced by hierarchical clustering methods on SNPs data.

MethodDistance functionCluster validity indexNumber of optimal clustersSilhouette value
Average LinkageJaccardSilhouette Index80.5163
Hamming70.5130
Complete LinkageJaccard50.5317
Hamming30.5103
Fig. 2

(a) The plots of silhouette values for Average Linkage and Complete Linkage with Jaccard and Hamming distances for number of clusters ranging from 2 to 100, (b) The VAT plots of the clusters produced by Average Linkage and Complete Linkage with Jaccard and Hamming distances for higher silhouette values, (c) The confusion matrices for comparing clustering results of Average Linkage with Jaccard and Hamming distances as reported in Table 2 and Average Linkage and Complete Linkage with Jaccard distance as reported in Table 3.

Optimal number of clusters produced by hierarchical clustering methods on SNPs data. (a) The plots of silhouette values for Average Linkage and Complete Linkage with Jaccard and Hamming distances for number of clusters ranging from 2 to 100, (b) The VAT plots of the clusters produced by Average Linkage and Complete Linkage with Jaccard and Hamming distances for higher silhouette values, (c) The confusion matrices for comparing clustering results of Average Linkage with Jaccard and Hamming distances as reported in Table 2 and Average Linkage and Complete Linkage with Jaccard distance as reported in Table 3.
Table 2

Mapping of SARS-CoV-2 genomes to the different clusters produced by hierarchical clustering methods.

MethodDistance functionCluster 1Cluster 2Cluster 3Cluster 4Cluster 5Cluster 6Cluster 7Cluster 8
Average LinkageJaccard90584662249728143351
Hamming9044497265012633321



Complete LinkageJaccard8898444407545360
Hamming9488766410
Table 3

Re-evaluation of five major clusters produced by Average Linkage and Complete Linkage clustering with Jaccard distance using silhouette value.

MethodDistance functionCluster validity indexNumber of clustersSilhouette value
Average LinkageJaccardSilhouette Index50.5581
Complete Linkage50.5344
Moreover, the mapping of the SARS-CoV-2 genomes to each cluster is reported in Table 2 . From Table 1, it can be seen that for Average Linkage with Jaccard and Hamming distance respectively, the silhouette values are 0.5163 and 0.5130 for Clusters 8 and 7. Similarly, for Complete Linkage, the silhouette values are 0.5317 and 0.5103 for Clusters 5 and 3 respectively. For both Average Linkage and Complete Linkage, Jaccard distance shows the higher silhouette value for different number of clusters. Therefore, to take a decision about the number of clusters, it is important to see the size and the structure of the clusters. The size of the clusters is shown in Table 2 by mapping the SARS-CoV-2 genomes to different clusters, while the structure of the clusters is visualised using VAT plots in Fig. 2(b). Mapping of SARS-CoV-2 genomes to the different clusters produced by hierarchical clustering methods. From the results of Table 2 and Fig. 2(b), it is found that for Average Linkage with Jaccard distance, Clusters 3, 6 and 8 have very small number of SARS-CoV-2 genomes as compared to the other clusters. Similarly, for Hamming Distance, Clusters 3 and 7 are also found to be very small in size. On the other hand, for Complete Linkage with Jaccard and Hamming Distances we have 5 and 3 clusters respectively where all the clusters have reasonable number of SARS-CoV-2 genomes. The clustering results of Average Linkage with Jaccard and Hamming distances are compared using confusion matrix in the first figure of Fig. 2(c) in order to build the consensus between them to identify the common number of SARS-CoV-2 genomes in the different clusters. As a result, five major clusters are identified of size 9026, 465, 497, 263 and 332. Thereafter, these results are subsequently compared once again in order to build the consensus between the clustering results of Complete Linkage with Jaccard distance using confusion matrix in the second figure of Fig. 2(c). It also shows five major clusters with 8887, 444, 492, 263 and 323 SARS-CoV-2 genomes. Furthermore, to evaluate the five clusters obtained from Average Linkage and Complete Linkage with Jaccard distances, silhouette values are computed and reported in Table 3 . Re-evaluation of five major clusters produced by Average Linkage and Complete Linkage clustering with Jaccard distance using silhouette value. The results show that the five clusters obtained by applying Average Linkage have a higher silhouette value, 0.5581, in comparison to all the other cases. Thus, after analysis of SNPs data of 10664 SARS-CoV-2 genomes of 73 countries, we can conclude five virus strains are present. After finding these five clusters as virus strains, SARS-CoV-2 genomes in 73 countries are mapped to the five clusters with the corresponding percentage which are reported in Table 4 . The same is also visualised through Circos plot in Fig. 3 . All the detailed clustering results are reported in Supplementary Table S1.
Table 4

Mapping of SARS-CoV-2 genomes to the five clusters and the corresponding percentage.

CountryTotal number of genomesCluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Number of genome(%)Number of genome(%)Number of genome(%)Number of genome(%)Number of genome(%)
USA2546223687.8270.27642.51762.991636.40
England1592118874.6218711.751157.2290.57935.84
China63158592.7110.16111.74304.7540.63
Australia58234559.2818631.96294.98142.4181.37
Netherlands56856599.470020.350010.18
India56649587.4600203.53488.4830.53
Iceland46238182.4710.227416.0261.3000
Scotland43440693.5500286.450000
Belgium42642198.830051.170000
Portugal34934297.990072.010000
Spain26720677.15248.99155.6200228.24
Wales21415873.833315.42157.0120.9362.80
Sweden19419410000000000
France18918396.8321.0621.0610.5310.53
New Zealand17516091.4300158.570000
Switzerland16410966.4600169.763823.1710.61
Denmark10910192.660076.420010.92
Japan948994.680022.130033.19
Brazil818098.770011.230000
Canada726793.0611.3945.560000
Luxembourg716084.510079.8611.4134.23
Germany686798.530011.470000
Itay665583.3311.52710.6134.5500
Kazakhstan492653.0612.04002244.9000
Oman424197.62000012.3800
Poland393384.6200410.260025.13
South Korea363610000000000
Vietnam312890.3226.450013.2300
Singapore281864.290027.1413.57725
Thailand282382.1400310.7127.1400
Russia272696.300013.700000
Finland261869.2300623.080027.69
Czech Republic252510000000000
Mexico211990.480014.7614.7600
Norway201470005251500
Northern Ireland19842.11842.11210.5315.2600
Estonia181810000000000
Austria161487.5000212.500000
Chile151493.330016.670000
DRC1512800016.6700213.33
Colombia141178.5700214.2917.1400
Senegal191263.1615.260000631.58
Croatia121083.330000216.6700
Georgia11545.45000000654.55
Kenya111110000000000
Malaysia11872.7300327.270000
Romania111110000000000
South Africa11763.6400436.360000
Ireland101010000000000
Latvia101010000000000
Nigeria8810000000000
Kuwait7571.4300114.29114.2900
Turkey5480000000120
Bangladesh4410000000000
Greece4410000000000
Qatar4375000012500
Slovakia4410000000000
Algeria3310000000000
Argentina3310000000000
Belarus3310000000000
Hungary3310000000000
Saudi Arabia4000012500375
Indonesia2150001500000
Israel2210000000000
Pakistan2210000000000
Serbia2210000000000
Slovenia2210000000000
Cambodia1110000000000
Lithuania1110000000000
Morocco1110000000000
Nepal1110000000000
Panama1110000000000
Peru1110000000000
Fig. 3

Circos plot to visualise the mapping of SARS-CoV-2 genomes to five clusters.

Mapping of SARS-CoV-2 genomes to the five clusters and the corresponding percentage. Circos plot to visualise the mapping of SARS-CoV-2 genomes to five clusters. Furthermore, top 10 signature SNPs from each cluster are identified and reported in Table 5 . In unsupervised learning (clustering), selection of features which may define a cluster is a non-trivial task. In this work, this selection has been performed by considering the frequency of a SNP in a cluster. For example, the frequency of occurrence of mutation point 241 is 6088 and thus considered to be the top most signature SNP in Cluster 1. To depict the common signature in the five clusters, visualisation in the form of Venn diagram is shown in Fig. 4 . As can be seen from the figure, there are no common SNPs in all the five clusters, thereby confirming the fact that such signature SNPs are features which indeed define the clusters. It is worth mentioning here that multiple changes in nucleotide may lead to multiple amino acid changes as well. For example in Table 5, at mutation point 14,408 there are two nucleotide changes, C>T and C>A. As a consequence, there are two changes in amino acid as well, P323L and P323H. Thus, for 50 signature SNPs in 5 clusters, the total number of non-synonymous signature SNPs are 20, out of which 16 are unique.
Table 5

Top 10 signature SNPs in each cluster.

ClusterNumber of sequences in each clusterCoordinate of signature SNPsOccurrence of signature SNPs in genomeChange in nucleotideChange in amino acidCoordinate of amino acid in proteinMapped with coding and Non-coding region
Cluster 188872416088C>TNANA5′-UTR
10591735C>TT>I85ORF1ab
30376071C>TSynonymous106ORF1ab
14,4086046(C>T)(C>A)(P>L), (P>H)323ORF1ab
23,4036073A>GD>G614Spike
25,5632232(G>T)(G>C)Q>H57ORF3a
28,8811855(G>A)(G>T)(R>K) (R>M)203Nucleocapsid
28,8821848(G>A)(G>T)Synonymous, (R>S)203Nucleocapsid
28,8831847G>CG>R204Nucleocapsid
29,8162613(T>A)(T>G)NANA3′-UTR



Cluster 244429,816416(T>A)(T>G)NANA3′-UTR
29,857348(C>A)(C>T)(C>G)NANA3′-UTR
29,858369(T>A)(T>C)(T>G)NANA3′-UTR
29,859402(T>A)(T>G)(T>C)NANA3′-UTR
29,861416(G>A)(G>C)(G>T)NANA3′-UTR
29,862427(G>C)(G>A)(G>T)NANA3′-UTR
29,864435(G>A)(G>C)(G>T)NANA3′-UTR
29,867437(T>A)(T>G)(T>C)NANA3′-UTR
29,868435(G>A)(G>T)(G>C)NANA3′-UTR
29,870433(C>A)(C>G)(C>T)NANA3′-UTR



Cluster 349219,557475(T>A)(T>C)(T>G)(F>L), Synonymous, (F>L)506ORF1ab
19,558479(A>G)(A>C)(A>T)(S>G) (S>R) (S>C)507ORF1ab
22,506469(C>A)(C>T)(C>G)(T>N) (T>I) (T>S)315Spike
29,776486(A>G)(A>T)NANA3′-UTR
29,779489(G>A)(G>T)NANA3′-UTR
29,780487(A>G)(A>C)NANA3′-UTR
29,781492(G>A)(G>T)(G>C)NANA3′-UTR
29,782487(A>G)(A>C)NANA3′-UTR
29,783490(G>C)(G>T)(G>A)NANA3′-UTR
29,784483(C>T)(C>A)(C>G)NANA3′-UTR



Cluster 426319,557263(T>A)(T>C)(T>G)(F>L), Synonymous, (F>L)506ORF1ab
19,558263(A>G)(A>C)(A>T)(S>G) (S>R) (S>C)507ORF1ab
29,858259(T>A)(T>C)(T>G)NANA3′-UTR
29,859262(T>A)(T>G)(T>C)NANA3′-UTR
29,860248(A>G)(A>C)(A>T)NANA3′-UTR
29,861263(G>A)(G>C)(G>T)NANA3′-UTR
29,862263(G>C)(G>A)(G>T)NANA3′-UTR
29,863245(A>C)(A>T)(A>G)NANA3′-UTR
29,864261(G>A)(G>C)(G>T)NANA3′-UTR
29,867249(T>A)(T>G)(T>C)NANA3′-UTR



Cluster 53233302(T>G)(T>A)(T>C)NANA5′-UTR
4306(A>G)(A>C)(A>T)NANA5′-UTR
5313(A>T)(A>G)(A>C)NANA5′-UTR
6314(A>T)(A>C)(A>G)NANA5′-UTR
7320(G>T)(G>A)(G>C)NANA5′-UTR
8322(G>A)(G>C)(G>T)NANA5′-UTR
10321(T>A)(T>C)(T>G)NANA5′-UTR
11319(T>C)(T>G)(T>A)NANA5′-UTR
12320(A>C)(A>T)(A>G)NANA5′-UTR
29,816320(T>A)(T>G)NANA3′-UTR
Fig. 4

Venn Digram to represent the common signature SNPs in five clusters.

Top 10 signature SNPs in each cluster. Venn Digram to represent the common signature SNPs in five clusters.

Discussions

SARS-CoV-2 has turned out to be a worldwide pandemic and has caused disruptions of epic proportions in human lives. In this situation, the identification of virus strains is a very important task. In this regard, we have analysed 10664 SARS-CoV-2 genomes of 73 countries around the world which resulted in some major outcomes: SNP identification from 10664 SARS-CoV-2 genomes of 73 countries, binary dataset creation from SNP data to find the number of clusters as virus strains present in the 73 countries, identification of top 10 signature SNPs for each cluster and the determination of the structural stability of the non-synonymous SNPs to judge the characteristics of the identified clusters. Initially, we have identified 107 SNPs which are then used to prepare the binary dataset to identify the presence or absence of SNPs. Hierarchical clustering such as Average Linkage and Complete Linkage with Jaccard and Hamming distance functions are then applied on this dataset to the number of clusters as virus strains. These clusters can be used to identify and design peptide based synthetic vaccines viz. epitopes (Ghosh et al., 2021). Also, top 10 signature SNPs from each cluster are identified based on their frequency, the details of which are mentioned in the Results section. Sequence and structural homology-based prediction of the non-synonymous signature SNPs along with their protein stability are reported in Table 6 using tools like PROVEAN (Protein Variation Effect Analyser) (Choi and Chan, 2015), PolyPhen-2 (Polymorphism Phenotyping) (Adzhubei et al., 2010) and I-Mutant 2.0 (Capriotti et al., 2005) to judge the characteristics of the identified clusters. PROVEAN7 works on sequence based prediction algorithm while the prediction of Polyphen-28 is based on sequence, structural and phylogenetic information pertaining to a SNP. On the other hand, I-Mutant 2.09 uses support vector machine (SVM) for the automatic prediction of protein stability changes upon single point mutations. PROVEAN and PolyPhen-2 are used to find the deleterious or damaging non-synonymous SNPs. The threshold value of PROVEAN is set at −2.5. If the PROVEAN score of a SNP is equal to or below this threshold, the corresponding non-synonymous mutation is considered to be deleterious. For Polyphen-2, this range is between 0 to 1. If the score is closer to 1, mutations are more confidently considered to be damaging. From the consensus of both PROVEAN and PolyPhen-2, it can be seen from Table 6 that out of 16 unique non-synonymous signature SNPs, 5 are predicted to be deleterious or damaging, out of which T85I, Q57H and R203M are in NSP2, ORF3a and Nucleocapsid respectively for Cluster 1. For both Clusters 3 and 4, such damaging non-synonymous signature SNPs are F506L and S507C in Exon while Clusters 2 and 3 do not exhibit any such behaviour due to the absence of non-synonymous signature SNPs. All of them are marked in bold in Table 6. Furthermore, protein stability is important to determine the functional and structural activity of a protein. Protein stability dictates the conformational structure of the protein, thereby determining its function. Any change in protein stability may cause misfolding, degradation or aberrant conglomeration of proteins (Hossain et al., 2020). The protein stabilities of the non-synonymous signature SNPs are determined using I-Mutant 2.0. The changes in the protein stability in I-Mutant 2.0 tool is predicted using reliability index (RI) and free energy change values (DDG). The outcome of I-Mutant 2.0 revealed that all of the deleterious 5 unique non-synonymous signature SNPs decrease the stability of the protein structure. These 5 SNPs are shown in their respective protein structures in Fig. 5 . The rest of the structures are provided in the Supplementary Figures S1 and S2. All these structures are taken from Zhanglab10 in the form of respective PDB files. It is to be noted that although some mutations are neutral, their stability can decrease like D614G in Spike protein.
Table 6

Sequence and structural homology-based prediction of non-synonymous signature SNPs along with their protein structural stability.

ClusterChange in amino acidCoded proteinPROVEAN
PolyPhen-2
I-Mutant 2.0
PredictionScorePredictionScoreStabilityDDGRI
Cluster 1T85INSP2Deleterious4.09Probably Damaging0.998Decrease1.717
P323LRdRpNeutral−0.865Benign0.005Decrease−0.86
P323HRdRpNeutral−0.865Benign0.005Decrease−2.096
D614GSpikeNeutral0.598Benign0.004Decrease−1.947
Q57HORF3aDeleterious3.286Probably Damaging0.966Decrease1.127
R203KNucleocapsidNeutral−1.604Probably Damaging0.969Decrease−2.265
R203MNucleocapsidDeleterious3.305Probably Damaging0.998Decrease1.524
R203SNucleocapsidNeutral−2.374Probably Damaging0.994Decrease−2.16
G204RNucleocapsidNeutral−1.656Probably Damaging1Decrease07



Cluster 3F506LExonDeleterious5.845Probably Damaging0.98Decrease2.485
S507GExonNeutral−2.337Possibly Damaging0.662Decrease−2.488
S507RExonNeutral−1.411Benign0.015Increase−0.392
S507CExonDeleterious2.759Probably Damaging0.997Decrease1.512
T315NSpikeNeutral−2.206Probably Damaging0.998Decrease−0.212
T315ISpikeNeutral0.365Probably Damaging0.999Decrease−0.214
T315SSpikeNeutral−1.217Probably Damaging0.995Decrease−0.516



Cluster 4F506LExonDeleterious5.845Probably Damaging0.98Decrease2.485
S507GExonNeutral−2.337Possibly Damaging0.662Decrease−2.488
S507RExonNeutral−1.411Benign0.015Increase−0.392
S507CExonDeleterious2.759Probably Damaging0.997Decrease1.512
Fig. 5

Non-synonymous signature SNPs highlighted in the structures of (a) NSP2 (b) ORF3a (c) Nucleocapsid and (d) Exon.

Sequence and structural homology-based prediction of non-synonymous signature SNPs along with their protein structural stability. Non-synonymous signature SNPs highlighted in the structures of (a) NSP2 (b) ORF3a (c) Nucleocapsid and (d) Exon. The temporal evolution of SARS-CoV-2 genomes in the five identified clusters for the 73 countries from the month of January till July 2020 is reported in Table 7 . For example, in the month of January, 548 genomes are present in Cluster 1. However, the number increased to 6264 in the month of March. In April, there were 1510 genomes while in May, June and July, the numbers are 541, 3 and 21 respectively. Furthermore, to understand the evolution of the SARS-CoV-2 genomes from the month of January till July 2020, their presence in different countries are identified by mapping the genomes to the five major clusters for the 73 countries. This is depicted in the form of pie charts in Table 8 . The corresponding colour representation for the five major clusters and the months are shown in Fig. 6 . The evolution of the genomes is quite evident from the pie charts in Table 8. For example, for USA, SARS-CoV-2 genomes mapped to Cluster 1 are mostly in the month of March and almost disappears by the month of May. For Cluster 2, the genomes are not present beyond the month of April. This indicates that the SARS-CoV-2 genomes present in Cluster 1 and Cluster 2 are either not virulent any longer after the months of May and April respectively or through evolution they may have mutated to be a part of some other clusters. Similarly for the other countries as well, one or the other kind of evolutions for the SARS-CoV-2 genomes for the five different clusters are observed.
Table 7

Temporal evolution of SARS-CoV-2 genomes in the five major clusters from January to July for 73 countries.

ClusterJanuaryMarchAprilMayJuneJuly
Cluster 154862641510541321
Cluster 213266719400
Cluster 3182761365705
Cluster 422146712103
Cluster 512185458100
Table 8

Pie charts to represent mapping of SARS-CoV-2 genomes to the five major clusters and the evolution of such genomes from January to July for 73 countries.

Fig. 6

Colours to represent (a) Five major clusters and b) Months from January to July 2020.

Temporal evolution of SARS-CoV-2 genomes in the five major clusters from January to July for 73 countries. Pie charts to represent mapping of SARS-CoV-2 genomes to the five major clusters and the evolution of such genomes from January to July for 73 countries. Colours to represent (a) Five major clusters and b) Months from January to July 2020. In order to see the common clusters among 73 countries, heatmap is created and shown in Fig. 7 . In the heatmap, deep blue indicates lesser number of common clusters and yellow shows more number of common clusters. This heatmap is not symmetric. For example, the virus strains from India belong to Clusters 1, 3, 4 and 5 while that for England is distributed among all the five clusters. Thus, if we consider row wise though India shares 100% common clusters as virus strains with England, when considered column wise England shares 80% common clusters as virus strains with India. In this regard, a hypothesis can be drawn that the same vaccine can be effective in those countries with common clusters or virus strains.
Fig. 7

Heatmap to represent the common clusters as virus strains among 73 countries.

Heatmap to represent the common clusters as virus strains among 73 countries. It is worth mentioning the advantage of the proposed clustering method as compared to the existing phylogenetic analysers like Maximum-Likelihood and Neighbour-Joining Trees in MEGA-X, GISAID CoVsurver and PANGOLIN. The existing analysers use all substitutions like mutation while the proposed clustering method is more robust as it uses only SNP data, thereby providing faster results. Moreover, SNP data is more relevant in our case as we have small sparse data.

Conclusion

In this work, we have analysed 10664 SARS-CoV-2 genomes of 73 countries to identify Single Nucleotide Polymorphisms (SNPs) which comprise of substitution that occurs in more than 1% of the virus population. As a result, 107 SNPs are identified throughout the genome (including coding and non-coding regions) to prepare a binary dataset of SNP. Thereafter, virus strains as clusters are identified from the SNP data. In this regard, hierarchical clustering viz. Average Linkage and Complete Linkage are applied with Jaccard and Hamming distance functions. Additionally, Silhouette Index is used to gauge the goodness of the clusters and also to determine the number of clusters as virus strains with the same distance functions. Jaccard distance gives the better Silhouette value. Thus, the consensus from the two clustering methods using Jaccard distance are used to determine the proper number of clusters which resulted in five major clusters. Moreover, using the presence of the five clusters in the 73 countries, we have also put forth the evolution of the clusters of virus genomes, starting from January till July 2020. Also, top 10 signature SNPs are identified in each cluster and the non-synonymous signature SNPs are visualised in protein structures. The sequence and structural homology-based prediction along with the protein structural stability of these non-synonymous signature SNPs are also determined to judge the characteristics of the identified clusters. Therefore, this work can be summarised as identification of SNPs from 10664 SARS-CoV-2 genomes of 73 countries, creation of binary dataset from SNP data in order to find the number of clusters as virus strains present in 73 countries, identifying signature SNPs in each cluster and determining the structural stability of the non-synonymous signature SNPs to judge the characteristics of the identified clusters.

Ethics approval and consent to participate

The ethical approval or individual consent was not applicable.

Availability of data and materials

The aligned 10664 SARS-CoV-2 genomes with reference and consensus sequences, SNP dataset and supplementary are available at “http://www.nitttrkol.ac.in/indrajit/projects/COVID-Mutation-10K-Clustering/”. Moreover, all the virus genomes used in this work are publicly available at GISAID database.

Consent for publication

Not applicable.

Funding

This work has been partially supported by CRGshort term research grant on COVID-19 (CVD/2020/000991) from Science and Engineering Research Board (, Department of Science and Technology, Govt. of India. This work is also partially funded by IDUB against COVID-19 project granted by Warsaw University of Technology under the program Excellence Initiative: Research University (IDUB). It is co-supported by Polish National Science Centre (2019/35/O/ST6/02484), Foundation for Polish Science co-financed by the European Union under the European Regional Development Fund (TEAM to DP).

Author contributions

Nimisha Ghosh: Conceptualization; Methodology; Data curation; Formal analysis; Software; Validation; Writing – original draft, Indrajit Saha: Conceptualization; Data curation; Supervision; Funding acquisition; Formal analysis; Investigation; Methodology; Web development; Project administration; Resources; Validation; Visualization; Writing – review & editing, Nikhil Sharma: Conceptualization; Formal analysis; Software; Validation; Visualization; Writing – review & editing, Suman Nandi: Software; Validation; Visualization; Writing – review & editing, Dariusz Plewczynski: Conceptualization; Data curation; Supervision; Funding acquisition; Methodology; Project administration; Resources; Validation; Writing – review & editing.

Conflict of interest

The authors declare that they have no conflict of interest.
  23 in total

1.  Clustal Omega, accurate alignment of very large numbers of sequences.

Authors:  Fabian Sievers; Desmond G Higgins
Journal:  Methods Mol Biol       Date:  2014

2.  A method and server for predicting damaging missense mutations.

Authors:  Ivan A Adzhubei; Steffen Schmidt; Leonid Peshkin; Vasily E Ramensky; Anna Gerasimova; Peer Bork; Alexey S Kondrashov; Shamil R Sunyaev
Journal:  Nat Methods       Date:  2010-04       Impact factor: 28.547

3.  Genome-wide analysis of Indian SARS-CoV-2 genomes for the identification of genetic mutation and SNP.

Authors:  Indrajit Saha; Nimisha Ghosh; Debasree Maity; Nikhil Sharma; Jnanendra Prasad Sarkar; Kaushik Mitra
Journal:  Infect Genet Evol       Date:  2020-07-11       Impact factor: 3.342

4.  HemI: a toolkit for illustrating heatmaps.

Authors:  Wankun Deng; Yongbo Wang; Zexian Liu; Han Cheng; Yu Xue
Journal:  PLoS One       Date:  2014-11-05       Impact factor: 3.240

Review 5.  Comparing SARS-CoV-2 with SARS-CoV and influenza pandemics.

Authors:  Eskild Petersen; Marion Koopmans; Unyeong Go; Davidson H Hamer; Nicola Petrosillo; Francesco Castelli; Merete Storgaard; Sulien Al Khalili; Lone Simonsen
Journal:  Lancet Infect Dis       Date:  2020-07-03       Impact factor: 25.071

6.  A pneumonia outbreak associated with a new coronavirus of probable bat origin.

Authors:  Peng Zhou; Xing-Lou Yang; Xian-Guang Wang; Ben Hu; Lei Zhang; Wei Zhang; Hao-Rui Si; Yan Zhu; Bei Li; Chao-Lin Huang; Hui-Dong Chen; Jing Chen; Yun Luo; Hua Guo; Ren-Di Jiang; Mei-Qin Liu; Ying Chen; Xu-Rui Shen; Xi Wang; Xiao-Shuang Zheng; Kai Zhao; Quan-Jiao Chen; Fei Deng; Lin-Lin Liu; Bing Yan; Fa-Xian Zhan; Yan-Yi Wang; Geng-Fu Xiao; Zheng-Li Shi
Journal:  Nature       Date:  2020-02-03       Impact factor: 69.504

7.  Emergence of genomic diversity and recurrent mutations in SARS-CoV-2.

Authors:  Lucy van Dorp; Mislav Acman; Damien Richard; Liam P Shaw; Charlotte E Ford; Louise Ormond; Christopher J Owen; Juanita Pang; Cedric C S Tan; Florencia A T Boshier; Arturo Torres Ortiz; François Balloux
Journal:  Infect Genet Evol       Date:  2020-05-05       Impact factor: 3.342

8.  Genetic cluster analysis of SARS-CoV-2 and the identification of those responsible for the major outbreaks in various countries.

Authors:  Xuemei Yang; Ning Dong; Edward Wai-Chi Chan; Sheng Chen
Journal:  Emerg Microbes Infect       Date:  2020-12       Impact factor: 7.163

9.  Protein clustering and RNA phylogenetic reconstruction of the influenza A [corrected] virus NS1 protein allow an update in classification and identification of motif conservation.

Authors:  Edgar E Sevilla-Reyes; David A Chavaro-Pérez; Elvira Piten-Isidro; Luis H Gutiérrez-González; Teresa Santos-Mendoza
Journal:  PLoS One       Date:  2013-05-07       Impact factor: 3.240

10.  Pathogen proteins eliciting antibodies do not share epitopes with host proteins: a bioinformatics approach.

Authors:  Isaac Amela; Juan Cedano; Enrique Querol
Journal:  PLoS One       Date:  2007-06-06       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.