| Literature DB >> 35456454 |
Yawei Li1, Qingyun Liu2, Zexian Zeng3, Yuan Luo1.
Abstract
Deciphering the population structure of SARS-CoV-2 is critical to inform public health management and reduce the risk of future dissemination. With the continuous accruing of SARS-CoV-2 genomes worldwide, discovering an effective way to group these genomes is critical for organizing the landscape of the population structure of the virus. Taking advantage of recently published state-of-the-art machine learning algorithms, we used an unsupervised deep learning clustering algorithm to group a total of 16,873 SARS-CoV-2 genomes. Using single nucleotide polymorphisms as input features, we identified six major subtypes of SARS-CoV-2. The proportions of the clusters across the continents revealed distinct geographical distributions. Comprehensive analysis indicated that both genetic factors and human migration factors shaped the specific geographical distribution of the population structure. This study provides a different approach using clustering methods to study the population structure of a never-seen-before and fast-growing species such as SARS-CoV-2. Moreover, clustering techniques can be used for further studies of local population structures of the proliferating virus.Entities:
Keywords: SARS-CoV-2; deep learning clustering; evolution; nucleotide substitutions; population structure
Mesh:
Year: 2022 PMID: 35456454 PMCID: PMC9030792 DOI: 10.3390/genes13040648
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.141
Figure 1Clustering of SARS-CoV-2. (A) Phylogenetic tree of the 16,873 SARS-CoV-2 strains. The inner colored panel represents the continent for each collected strain, and the outer colored panel represents the partitions of the six clusters in the tree. (B) Mean intra-cluster and inter-cluster pairwise genetic distances across the six clusters. The blue bars represent mean pairwise genetic distances between pairs of isolates within the corresponding clusters, and the red bars represent mean pairwise genetic distances between pairs of isolates outside the corresponding clusters. The error bars display the standard deviations. For all six clusters, the mean intra-cluster pairwise genetic distances are significantly lower than the corresponding mean inter-cluster pairwise genetic distances (p-value < 0.001, Wilcoxon rank-sum test). (C) The t-SNE plot displays the clustering of the strains.
Figure 2Geographic distributions of the six clusters. Pie charts display the proportions of six clusters among all SARS-CoV-2 strains in each country. Circle sizes and the color scales correspond to the number of strains analyzed per country.
Geographic distribution of six continents for each cluster.
| Cluster | Cluster A | Cluster B | Cluster C | Cluster D | Cluster E | Cluster F | Total |
|---|---|---|---|---|---|---|---|
| Africa | 3 | 4 | 65 | 7 | 10 | 9 | 98 |
| Asia | 38 | 648 | 248 | 217 | 57 | 116 | 1324 |
| Europe | 1137 | 990 | 3119 | 212 | 1108 | 2961 | 9527 |
| North America | 94 | 334 | 625 | 1268 | 2274 | 170 | 4765 |
| Oceania | 110 | 161 | 233 | 196 | 191 | 149 | 1040 |
| South America | 6 | 5 | 44 | 10 | 5 | 49 | 119 |
| Total | 1388 | 2142 | 4334 | 1910 | 3645 | 3454 | 16,873 |
Figure 3The genetic diversity between clusters. (A) The mutation counts over days of 16,873 SARS-CoV-2 strains. The X axis represents the days from the corresponding collection date of strains to 24 December 2019 when the earliest strain (EPI_ISL_402123) was collected. The Y axis represents the number of mutations of each collected strain. A mutation is defined by a nucleotide change from the original nucleotide in the reference genome to the alternative nucleotide in the studied viral genome. (B–G) The intra-cluster nucleotide diversity (π) per site for each gene and genome-wide across six clusters.
Figure 4The clustering of the six clusters by the extracted mutations. (A) The heatmap displays mutation frequency of the 42 mutations across six clusters. The colors and values represent different frequencies of the corresponding mutations in each cluster. The collected days of the mutations are represented in (B). The X axis represents the days from the corresponding collection date of strains to 24 December 2019 when the earliest strain (EPI_ISL_402123) was collected. Circle sizes represent the frequency the of the mutations on each collection day.
The information of the 42 mutations using ANOVA.
| Mutation | Substitution | Amino Acid Substitution | Type | Gene | Frequency | Cluster | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| A | B | C | D | E | F | ||||||
| C241T | C>T | Intron | Intron | Intron | 66.37% | 10 | 10 | 4238 | 2 | 3548 | 3391 |
| T490A | T>A | D>E | N | ORF1ab | 1.04% | 0 | 0 | 1 | 174 | 0 | 0 |
| T514C | T>C | H>H | S | ORF1ab | 0.97% | 0 | 162 | 1 | 0 | 0 | 0 |
| C1059T * | C>T | T>I | N | ORF1ab | 21.69% | 1 | 8 | 2 | 0 | 3645 | 3 |
| G1397A | G>A | V>I | N | ORF1ab | 1.12% | 0 | 186 | 0 | 0 | 1 | 2 |
| G1440A | G>A | G>D | N | ORF1ab | 1.92% | 0 | 324 | 0 | 0 | 0 | 0 |
| A2480G | A>G | I>V | N | ORF1ab | 3.60% | 608 | 0 | 0 | 0 | 0 | 0 |
| C2558T | C>T | P>S | N | ORF1ab | 3.83% | 646 | 1 | 0 | 0 | 0 | 0 |
| G2891A * | G>A | A>T | N | ORF1ab | 1.77% | 0 | 298 | 0 | 0 | 0 | 0 |
| C3037T | C>T | F>F | S | ORF1ab | 67.26% | 2 | 7 | 4277 | 3 | 3611 | 3448 |
| C3177T | C>T | P>L | N | ORF1ab | 1.05% | 0 | 0 | 1 | 171 | 6 | 0 |
| C6312A | C>A | T>K | N | ORF1ab | 1.14% | 0 | 189 | 1 | 0 | 0 | 3 |
| C8782T | C>T | S>S | S | ORF1ab | 11.42% | 1 | 21 | 5 | 1898 | 1 | 1 |
| T9477A | T>A | F>Y | N | ORF1ab | 1.17% | 0 | 3 | 0 | 195 | 0 | 0 |
| G11083T * | G>T | L>F | N | ORF1ab | 11.81% | 1342 | 485 | 52 | 21 | 54 | 39 |
| C14408T * | C>T | P>L | N | ORF1ab | 67.47% | 1 | 8 | 4301 | 2 | 3636 | 3436 |
| C14805T | C>T | Y>Y | S | ORF1ab | 9.39% | 1352 | 8 | 1 | 195 | 0 | 28 |
| T17247C | T>C | R>R | S | ORF1ab | 3.00% | 500 | 5 | 1 | 0 | 0 | 0 |
| C17747T * | C>T | P>L | N | ORF1ab | 6.92% | 1 | 0 | 0 | 1165 | 1 | 0 |
| A17858G | A>G | Y>C | N | ORF1ab | 7.05% | 1 | 1 | 0 | 1187 | 0 | 0 |
| C18060T | C>T | L>L | S | ORF1ab | 7.16% | 0 | 3 | 2 | 1202 | 1 | 0 |
| T18736C | T>C | F>L | N | ORF1ab | 1.01% | 0 | 0 | 1 | 169 | 0 | 0 |
| C18877T | C>T | L>L | S | ORF1ab | 2.67% | 2 | 2 | 440 | 4 | 0 | 2 |
| A20268G | A>G | L>L | S | ORF1ab | 4.61% | 0 | 1 | 773 | 3 | 0 | 1 |
| A23403G * | A>G | D>G | N | S | 67.65% | 4 | 4 | 4316 | 6 | 3634 | 3451 |
| C23731T | C>T | T>T | S | S | 1.68% | 0 | 0 | 0 | 0 | 1 | 282 |
| C23929T | C>T | Y>Y | S | S | 1.13% | 0 | 186 | 1 | 0 | 1 | 2 |
| C24034T | C>T | N>N | S | S | 1.16% | 0 | 2 | 1 | 187 | 4 | 1 |
| G25563T * | G>T | Q>H | N | ORF3a | 26.44% | 1 | 3 | 829 | 2 | 3625 | 2 |
| G25979T | G>T | G>V | N | ORF3a | 1.16% | 0 | 2 | 1 | 193 | 0 | 0 |
| G26144T * | G>T | G>V | N | ORF3a | 8.61% | 1387 | 62 | 0 | 1 | 1 | 1 |
| T26729C | T>C | A>A | S | M | 1.07% | 0 | 1 | 1 | 179 | 0 | 0 |
| C27046T | C>T | T>M | N | M | 2.13% | 0 | 1 | 5 | 0 | 0 | 353 |
| G28077C | G>C | V>L | N | ORF8 | 1.13% | 0 | 1 | 1 | 188 | 0 | 0 |
| T28144C * | T>C | L>S | N | ORF8 | 11.36% | 0 | 10 | 1 | 1903 | 2 | 0 |
| C28657T | C>T | D>D | S | N | 1.21% | 0 | 3 | 3 | 196 | 1 | 2 |
| T28688C | T>C | L>L | S | N | 1.07% | 0 | 178 | 1 | 0 | 1 | 0 |
| C28863T | C>T | S>L | N | N | 1.19% | 1 | 2 | 2 | 193 | 2 | 0 |
| G28881A | G>A | R>K | N | N | 20.54% | 4 | 3 | 3 | 1 | 1 | 3453 |
| G28882A | G>A | R>K 1 | N | N | 20.49% | 1 | 2 | 0 | 0 | 0 | 3454 |
| G28883C | G>C | G>R | N | N | 20.49% | 1 | 2 | 1 | 0 | 0 | 3453 |
| A29700G | A>G | Intron | Intron | Intron | 1.04% | 0 | 0 | 4 | 167 | 4 | 1 |
1 G28881A and G28882A occur within the same codon. Amino acid annotation (R>K) is based on the co-occurrence of these mutations. * Under positive selection inferred by HyPhy.