| Literature DB >> 33781798 |
Nimisha Ghosh1, Indrajit Saha2, Nikhil Sharma3, Suman Nandi4, Dariusz Plewczynski5.
Abstract
Since the onslaught of SARS-CoV-2, the research community has been searching for a vaccine to fight against this virus. However, during this period, the virus has mutated to adapt to the different environmental conditions in the world and made the task of vaccine design more challenging. In this situation, the identification of virus strains is very much timely and important task. We have performed genome-wide analysis of 10664 SARS-CoV-2 genomes of 73 countries to identify and prepare a Single Nucleotide Polymorphism (SNP) dataset of SARS-CoV-2. Thereafter, with the use of this SNP data, the advantage of hierarchical clustering is taken care of in such a way so that Average Linkage and Complete Linkage with Jaccard and Hamming distance functions are applied separately in order to identify the virus strains as clusters present in the SNP data. In this regard, the consensus of both the clustering results are also considered while Silhouette index is used as a cluster validity index to measure the goodness of the clusters as well to determine the number of clusters or virus strains. As a result, we have identified five major clusters or virus strains present worldwide. Apart from quantitative measures, these clusters are also visualized using Visual Assessment of Tendency (VAT) plot. The evolution of these clusters are also shown. Furthermore, top 10 signature SNPs are identified in each cluster and the non-synonymous signature SNPs are visualised in the respective protein structures. Also, the sequence and structural homology-based prediction along with the protein structural stability of these non-synonymous signature SNPs are reported in order to judge the characteristics of the identified clusters. As a consequence, T85I, Q57H and R203M in NSP2, ORF3a and Nucleocapsid respectively are found to be responsible for Cluster 1 as they are damaging and unstable non-synonymous signature SNPs. Similarly, F506L and S507C in Exon are responsible for both Clusters 3 and 4 while Clusters 2 and 5 do not exhibit such behaviour due to the absence of any non-synonymous signature SNPs. In addition to all these, the code, SNP dataset, 10664 labelled SARS-CoV-2 strains and additional results as supplementary are provided through our website for further use.Entities:
Keywords: COVID-19; Clustering; Multiple sequence alignment; Non-synonymous SNP; SARS-CoV-2
Year: 2021 PMID: 33781798 PMCID: PMC7997709 DOI: 10.1016/j.virusres.2021.198401
Source DB: PubMed Journal: Virus Res ISSN: 0168-1702 Impact factor: 3.303
Fig. 1(a) Pipeline of the Workflow, (b) Detection technique to find mutation as substitution, (c) Bar plot to represent the frequency of SNPs at different genomic positions, BioCircos plot to represent SNPs with corresponding coding regions and example of SNP dataset as binary matrix 1 and 0 represents presence and absence of SNP in any specific sequence.
Optimal number of clusters produced by hierarchical clustering methods on SNPs data.
| Method | Distance function | Cluster validity index | Number of optimal clusters | Silhouette value |
|---|---|---|---|---|
| Average Linkage | Jaccard | Silhouette Index | 8 | 0.5163 |
| Hamming | 7 | 0.5130 | ||
| Complete Linkage | Jaccard | 5 | 0.5317 | |
| Hamming | 3 | 0.5103 |
Fig. 2(a) The plots of silhouette values for Average Linkage and Complete Linkage with Jaccard and Hamming distances for number of clusters ranging from 2 to 100, (b) The VAT plots of the clusters produced by Average Linkage and Complete Linkage with Jaccard and Hamming distances for higher silhouette values, (c) The confusion matrices for comparing clustering results of Average Linkage with Jaccard and Hamming distances as reported in Table 2 and Average Linkage and Complete Linkage with Jaccard distance as reported in Table 3.
Mapping of SARS-CoV-2 genomes to the different clusters produced by hierarchical clustering methods.
| Method | Distance function | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | Cluster 6 | Cluster 7 | Cluster 8 |
|---|---|---|---|---|---|---|---|---|---|
| Average Linkage | Jaccard | 9058 | 466 | 22 | 497 | 281 | 4 | 335 | 1 |
| Hamming | 9044 | 497 | 26 | 501 | 263 | 332 | 1 | – | |
| Complete Linkage | Jaccard | 8898 | 444 | 407 | 545 | 360 | – | – | – |
| Hamming | 9488 | 766 | 410 | – | – | – | – | – | |
Re-evaluation of five major clusters produced by Average Linkage and Complete Linkage clustering with Jaccard distance using silhouette value.
| Method | Distance function | Cluster validity index | Number of clusters | Silhouette value |
|---|---|---|---|---|
| Average Linkage | Jaccard | Silhouette Index | 5 | 0.5581 |
| Complete Linkage | 5 | 0.5344 |
Mapping of SARS-CoV-2 genomes to the five clusters and the corresponding percentage.
| Country | Total number of genomes | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Number of genome | (%) | Number of genome | (%) | Number of genome | (%) | Number of genome | (%) | Number of genome | (%) | ||
| USA | 2546 | 2236 | 87.82 | 7 | 0.27 | 64 | 2.51 | 76 | 2.99 | 163 | 6.40 |
| England | 1592 | 1188 | 74.62 | 187 | 11.75 | 115 | 7.22 | 9 | 0.57 | 93 | 5.84 |
| China | 631 | 585 | 92.71 | 1 | 0.16 | 11 | 1.74 | 30 | 4.75 | 4 | 0.63 |
| Australia | 582 | 345 | 59.28 | 186 | 31.96 | 29 | 4.98 | 14 | 2.41 | 8 | 1.37 |
| Netherlands | 568 | 565 | 99.47 | 0 | 0 | 2 | 0.35 | 0 | 0 | 1 | 0.18 |
| India | 566 | 495 | 87.46 | 0 | 0 | 20 | 3.53 | 48 | 8.48 | 3 | 0.53 |
| Iceland | 462 | 381 | 82.47 | 1 | 0.22 | 74 | 16.02 | 6 | 1.30 | 0 | 0 |
| Scotland | 434 | 406 | 93.55 | 0 | 0 | 28 | 6.45 | 0 | 0 | 0 | 0 |
| Belgium | 426 | 421 | 98.83 | 0 | 0 | 5 | 1.17 | 0 | 0 | 0 | 0 |
| Portugal | 349 | 342 | 97.99 | 0 | 0 | 7 | 2.01 | 0 | 0 | 0 | 0 |
| Spain | 267 | 206 | 77.15 | 24 | 8.99 | 15 | 5.62 | 0 | 0 | 22 | 8.24 |
| Wales | 214 | 158 | 73.83 | 33 | 15.42 | 15 | 7.01 | 2 | 0.93 | 6 | 2.80 |
| Sweden | 194 | 194 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| France | 189 | 183 | 96.83 | 2 | 1.06 | 2 | 1.06 | 1 | 0.53 | 1 | 0.53 |
| New Zealand | 175 | 160 | 91.43 | 0 | 0 | 15 | 8.57 | 0 | 0 | 0 | 0 |
| Switzerland | 164 | 109 | 66.46 | 0 | 0 | 16 | 9.76 | 38 | 23.17 | 1 | 0.61 |
| Denmark | 109 | 101 | 92.66 | 0 | 0 | 7 | 6.42 | 0 | 0 | 1 | 0.92 |
| Japan | 94 | 89 | 94.68 | 0 | 0 | 2 | 2.13 | 0 | 0 | 3 | 3.19 |
| Brazil | 81 | 80 | 98.77 | 0 | 0 | 1 | 1.23 | 0 | 0 | 0 | 0 |
| Canada | 72 | 67 | 93.06 | 1 | 1.39 | 4 | 5.56 | 0 | 0 | 0 | 0 |
| Luxembourg | 71 | 60 | 84.51 | 0 | 0 | 7 | 9.86 | 1 | 1.41 | 3 | 4.23 |
| Germany | 68 | 67 | 98.53 | 0 | 0 | 1 | 1.47 | 0 | 0 | 0 | 0 |
| Itay | 66 | 55 | 83.33 | 1 | 1.52 | 7 | 10.61 | 3 | 4.55 | 0 | 0 |
| Kazakhstan | 49 | 26 | 53.06 | 1 | 2.04 | 0 | 0 | 22 | 44.90 | 0 | 0 |
| Oman | 42 | 41 | 97.62 | 0 | 0 | 0 | 0 | 1 | 2.38 | 0 | 0 |
| Poland | 39 | 33 | 84.62 | 0 | 0 | 4 | 10.26 | 0 | 0 | 2 | 5.13 |
| South Korea | 36 | 36 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Vietnam | 31 | 28 | 90.32 | 2 | 6.45 | 0 | 0 | 1 | 3.23 | 0 | 0 |
| Singapore | 28 | 18 | 64.29 | 0 | 0 | 2 | 7.14 | 1 | 3.57 | 7 | 25 |
| Thailand | 28 | 23 | 82.14 | 0 | 0 | 3 | 10.71 | 2 | 7.14 | 0 | 0 |
| Russia | 27 | 26 | 96.30 | 0 | 0 | 1 | 3.70 | 0 | 0 | 0 | 0 |
| Finland | 26 | 18 | 69.23 | 0 | 0 | 6 | 23.08 | 0 | 0 | 2 | 7.69 |
| Czech Republic | 25 | 25 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Mexico | 21 | 19 | 90.48 | 0 | 0 | 1 | 4.76 | 1 | 4.76 | 0 | 0 |
| Norway | 20 | 14 | 70 | 0 | 0 | 5 | 25 | 1 | 5 | 0 | 0 |
| Northern Ireland | 19 | 8 | 42.11 | 8 | 42.11 | 2 | 10.53 | 1 | 5.26 | 0 | 0 |
| Estonia | 18 | 18 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Austria | 16 | 14 | 87.50 | 0 | 0 | 2 | 12.50 | 0 | 0 | 0 | 0 |
| Chile | 15 | 14 | 93.33 | 0 | 0 | 1 | 6.67 | 0 | 0 | 0 | 0 |
| DRC | 15 | 12 | 80 | 0 | 0 | 1 | 6.67 | 0 | 0 | 2 | 13.33 |
| Colombia | 14 | 11 | 78.57 | 0 | 0 | 2 | 14.29 | 1 | 7.14 | 0 | 0 |
| Senegal | 19 | 12 | 63.16 | 1 | 5.26 | 0 | 0 | 0 | 0 | 6 | 31.58 |
| Croatia | 12 | 10 | 83.33 | 0 | 0 | 0 | 0 | 2 | 16.67 | 0 | 0 |
| Georgia | 11 | 5 | 45.45 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 54.55 |
| Kenya | 11 | 11 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Malaysia | 11 | 8 | 72.73 | 0 | 0 | 3 | 27.27 | 0 | 0 | 0 | 0 |
| Romania | 11 | 11 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| South Africa | 11 | 7 | 63.64 | 0 | 0 | 4 | 36.36 | 0 | 0 | 0 | 0 |
| Ireland | 10 | 10 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Latvia | 10 | 10 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Nigeria | 8 | 8 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Kuwait | 7 | 5 | 71.43 | 0 | 0 | 1 | 14.29 | 1 | 14.29 | 0 | 0 |
| Turkey | 5 | 4 | 80 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 20 |
| Bangladesh | 4 | 4 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Greece | 4 | 4 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Qatar | 4 | 3 | 75 | 0 | 0 | 0 | 0 | 1 | 25 | 0 | 0 |
| Slovakia | 4 | 4 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Algeria | 3 | 3 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Argentina | 3 | 3 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Belarus | 3 | 3 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Hungary | 3 | 3 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Saudi Arabia | 4 | 0 | 0 | 0 | 0 | 1 | 25 | 0 | 0 | 3 | 75 |
| Indonesia | 2 | 1 | 50 | 0 | 0 | 1 | 50 | 0 | 0 | 0 | 0 |
| Israel | 2 | 2 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Pakistan | 2 | 2 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Serbia | 2 | 2 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Slovenia | 2 | 2 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Cambodia | 1 | 1 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Lithuania | 1 | 1 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Morocco | 1 | 1 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Nepal | 1 | 1 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Panama | 1 | 1 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Peru | 1 | 1 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Fig. 3Circos plot to visualise the mapping of SARS-CoV-2 genomes to five clusters.
Top 10 signature SNPs in each cluster.
| Cluster | Number of sequences in each cluster | Coordinate of signature SNPs | Occurrence of signature SNPs in genome | Change in nucleotide | Change in amino acid | Coordinate of amino acid in protein | Mapped with coding and Non-coding region |
|---|---|---|---|---|---|---|---|
| Cluster 1 | 8887 | 241 | 6088 | C>T | NA | NA | 5′-UTR |
| 1059 | 1735 | C>T | T>I | 85 | ORF1ab | ||
| 3037 | 6071 | C>T | Synonymous | 106 | ORF1ab | ||
| 14,408 | 6046 | (C>T)(C>A) | (P>L), (P>H) | 323 | ORF1ab | ||
| 23,403 | 6073 | A>G | D>G | 614 | Spike | ||
| 25,563 | 2232 | (G>T)(G>C) | Q>H | 57 | ORF3a | ||
| 28,881 | 1855 | (G>A)(G>T) | (R>K) (R>M) | 203 | Nucleocapsid | ||
| 28,882 | 1848 | (G>A)(G>T) | Synonymous, (R>S) | 203 | Nucleocapsid | ||
| 28,883 | 1847 | G>C | G>R | 204 | Nucleocapsid | ||
| 29,816 | 2613 | (T>A)(T>G) | NA | NA | 3′-UTR | ||
| Cluster 2 | 444 | 29,816 | 416 | (T>A)(T>G) | NA | NA | 3′-UTR |
| 29,857 | 348 | (C>A)(C>T)(C>G) | NA | NA | 3′-UTR | ||
| 29,858 | 369 | (T>A)(T>C)(T>G) | NA | NA | 3′-UTR | ||
| 29,859 | 402 | (T>A)(T>G)(T>C) | NA | NA | 3′-UTR | ||
| 29,861 | 416 | (G>A)(G>C)(G>T) | NA | NA | 3′-UTR | ||
| 29,862 | 427 | (G>C)(G>A)(G>T) | NA | NA | 3′-UTR | ||
| 29,864 | 435 | (G>A)(G>C)(G>T) | NA | NA | 3′-UTR | ||
| 29,867 | 437 | (T>A)(T>G)(T>C) | NA | NA | 3′-UTR | ||
| 29,868 | 435 | (G>A)(G>T)(G>C) | NA | NA | 3′-UTR | ||
| 29,870 | 433 | (C>A)(C>G)(C>T) | NA | NA | 3′-UTR | ||
| Cluster 3 | 492 | 19,557 | 475 | (T>A)(T>C)(T>G) | (F>L), Synonymous, (F>L) | 506 | ORF1ab |
| 19,558 | 479 | (A>G)(A>C)(A>T) | (S>G) (S>R) (S>C) | 507 | ORF1ab | ||
| 22,506 | 469 | (C>A)(C>T)(C>G) | (T>N) (T>I) (T>S) | 315 | Spike | ||
| 29,776 | 486 | (A>G)(A>T) | NA | NA | 3′-UTR | ||
| 29,779 | 489 | (G>A)(G>T) | NA | NA | 3′-UTR | ||
| 29,780 | 487 | (A>G)(A>C) | NA | NA | 3′-UTR | ||
| 29,781 | 492 | (G>A)(G>T)(G>C) | NA | NA | 3′-UTR | ||
| 29,782 | 487 | (A>G)(A>C) | NA | NA | 3′-UTR | ||
| 29,783 | 490 | (G>C)(G>T)(G>A) | NA | NA | 3′-UTR | ||
| 29,784 | 483 | (C>T)(C>A)(C>G) | NA | NA | 3′-UTR | ||
| Cluster 4 | 263 | 19,557 | 263 | (T>A)(T>C)(T>G) | (F>L), Synonymous, (F>L) | 506 | ORF1ab |
| 19,558 | 263 | (A>G)(A>C)(A>T) | (S>G) (S>R) (S>C) | 507 | ORF1ab | ||
| 29,858 | 259 | (T>A)(T>C)(T>G) | NA | NA | 3′-UTR | ||
| 29,859 | 262 | (T>A)(T>G)(T>C) | NA | NA | 3′-UTR | ||
| 29,860 | 248 | (A>G)(A>C)(A>T) | NA | NA | 3′-UTR | ||
| 29,861 | 263 | (G>A)(G>C)(G>T) | NA | NA | 3′-UTR | ||
| 29,862 | 263 | (G>C)(G>A)(G>T) | NA | NA | 3′-UTR | ||
| 29,863 | 245 | (A>C)(A>T)(A>G) | NA | NA | 3′-UTR | ||
| 29,864 | 261 | (G>A)(G>C)(G>T) | NA | NA | 3′-UTR | ||
| 29,867 | 249 | (T>A)(T>G)(T>C) | NA | NA | 3′-UTR | ||
| Cluster 5 | 323 | 3 | 302 | (T>G)(T>A)(T>C) | NA | NA | 5′-UTR |
| 4 | 306 | (A>G)(A>C)(A>T) | NA | NA | 5′-UTR | ||
| 5 | 313 | (A>T)(A>G)(A>C) | NA | NA | 5′-UTR | ||
| 6 | 314 | (A>T)(A>C)(A>G) | NA | NA | 5′-UTR | ||
| 7 | 320 | (G>T)(G>A)(G>C) | NA | NA | 5′-UTR | ||
| 8 | 322 | (G>A)(G>C)(G>T) | NA | NA | 5′-UTR | ||
| 10 | 321 | (T>A)(T>C)(T>G) | NA | NA | 5′-UTR | ||
| 11 | 319 | (T>C)(T>G)(T>A) | NA | NA | 5′-UTR | ||
| 12 | 320 | (A>C)(A>T)(A>G) | NA | NA | 5′-UTR | ||
| 29,816 | 320 | (T>A)(T>G) | NA | NA | 3′-UTR | ||
Fig. 4Venn Digram to represent the common signature SNPs in five clusters.
Sequence and structural homology-based prediction of non-synonymous signature SNPs along with their protein structural stability.
| Cluster | Change in amino acid | Coded protein | PROVEAN | PolyPhen-2 | I-Mutant 2.0 | ||||
|---|---|---|---|---|---|---|---|---|---|
| Prediction | Score | Prediction | Score | Stability | DDG | RI | |||
| Cluster 1 | − | − | |||||||
| P323L | RdRp | Neutral | −0.865 | Benign | 0.005 | Decrease | −0.8 | 6 | |
| P323H | RdRp | Neutral | −0.865 | Benign | 0.005 | Decrease | −2.09 | 6 | |
| D614G | Spike | Neutral | 0.598 | Benign | 0.004 | Decrease | −1.94 | 7 | |
| − | − | ||||||||
| R203K | Nucleocapsid | Neutral | −1.604 | Probably Damaging | 0.969 | Decrease | −2.26 | 5 | |
| − | Decrease | − | |||||||
| R203S | Nucleocapsid | Neutral | −2.374 | Probably Damaging | 0.994 | Decrease | −2.1 | 6 | |
| G204R | Nucleocapsid | Neutral | −1.656 | Probably Damaging | 1 | Decrease | 0 | 7 | |
| Cluster 3 | − | − | |||||||
| S507G | Exon | Neutral | −2.337 | Possibly Damaging | 0.662 | Decrease | −2.48 | 8 | |
| S507R | Exon | Neutral | −1.411 | Benign | 0.015 | Increase | −0.39 | 2 | |
| − | − | ||||||||
| T315N | Spike | Neutral | −2.206 | Probably Damaging | 0.998 | Decrease | −0.21 | 2 | |
| T315I | Spike | Neutral | 0.365 | Probably Damaging | 0.999 | Decrease | −0.21 | 4 | |
| T315S | Spike | Neutral | −1.217 | Probably Damaging | 0.995 | Decrease | −0.51 | 6 | |
| Cluster 4 | − | − | |||||||
| S507G | Exon | Neutral | −2.337 | Possibly Damaging | 0.662 | Decrease | −2.48 | 8 | |
| S507R | Exon | Neutral | −1.411 | Benign | 0.015 | Increase | −0.39 | 2 | |
| − | − | ||||||||
Fig. 5Non-synonymous signature SNPs highlighted in the structures of (a) NSP2 (b) ORF3a (c) Nucleocapsid and (d) Exon.
Temporal evolution of SARS-CoV-2 genomes in the five major clusters from January to July for 73 countries.
| Cluster | January | March | April | May | June | July |
|---|---|---|---|---|---|---|
| Cluster 1 | 548 | 6264 | 1510 | 541 | 3 | 21 |
| Cluster 2 | 13 | 266 | 71 | 94 | 0 | 0 |
| Cluster 3 | 18 | 276 | 136 | 57 | 0 | 5 |
| Cluster 4 | 22 | 146 | 71 | 21 | 0 | 3 |
| Cluster 5 | 12 | 185 | 45 | 81 | 0 | 0 |
Pie charts to represent mapping of SARS-CoV-2 genomes to the five major clusters and the evolution of such genomes from January to July for 73 countries.
Fig. 6Colours to represent (a) Five major clusters and b) Months from January to July 2020.
Fig. 7Heatmap to represent the common clusters as virus strains among 73 countries.