| Literature DB >> 32981683 |
Kiril Kuzmin1, Ayotomiwa Ezekiel Adeniyi2, Arthur Kevin DaSouza3, Deuk Lim3, Huyen Nguyen3, Nuria Ramirez Molina3, Lanqiao Xiong3, Irene T Weber3, Robert W Harrison4.
Abstract
Coronaviruses infect many animals, including humans, due to interspecies transmission. Three of the known human coronaviruses: MERS, SARS-CoV-1, and SARS-CoV-2, the pathogen for the COVID-19 pandemic, cause severe disease. Improved methods to predict host specificity of coronaviruses will be valuable for identifying and controlling future outbreaks. The coronavirus S protein plays a key role in host specificity by attaching the virus to receptors on the cell membrane. We analyzed 1238 spike sequences for their host specificity. Spike sequences readily segregate in t-SNE embeddings into clusters of similar hosts and/or virus species. Machine learning with SVM, Logistic Regression, Decision Tree, Random Forest gave high average accuracies, F1 scores, sensitivities and specificities of 0.95-0.99. Importantly, sites identified by Decision Tree correspond to protein regions with known biological importance. These results demonstrate that spike sequences alone can be used to predict host specificity.Entities:
Keywords: Machine learning; Sequence clustering; Spike protein; Viral host specificity; coronaviruses; t-SNE
Mesh:
Substances:
Year: 2020 PMID: 32981683 PMCID: PMC7500881 DOI: 10.1016/j.bbrc.2020.09.010
Source DB: PubMed Journal: Biochem Biophys Res Commun ISSN: 0006-291X Impact factor: 3.575
Fig. 1The distribution of the cleaned dataset (1238 sequences). a The top 6 genera of CoVs: avian CoV, porcine epidemic diarrhea virus (PEDV), betacoronavirus 1 (BCoV-1), MERS, porcine CoV HKU15, SARS-CoV-2. b The top 5 of CoVs’ hosts: swine, avians, human, bats, and camels.
Fig. 2Explained variance vs number of components for (i) all 1238 sequences and (ii) for the sequences whose hosts are either avian or swine (667 entries). The 0.9 threshold is and for (i) and (ii) respectively. The 0.95 threshold is and for (i) and (ii) respectively. We chose 50 components to perform SVD which provided 94.20% and 96.28% of variance for (i) and (ii) respectively.
Fig. 3t-SNE embeddings of all sequences (a–e), and human related CoV sequences only (f). a Human related CoVs vs the other CoVs. b The most represented virus species in the dataset. c Major types of hosts in the dataset. d Relative locations of different genera of CoVs. e Human related CoVs embedded with all other CoV sequences (1238 entries in total). f Embeddings of human related CoVs only (463 entries in total).
3-fold cross-validation of different classifiers for . The results are presented as mean (μ) standard deviation (σ) of 4 measures of performance: accuracy (Ac), -score (), sensitivity (Sn), and specificity (Sp). The best performances (with the greatest value ) are shown in bold.
| SVM | LR | DT | RF | |
|---|---|---|---|---|
| Inputs with 59,900 components | ||||
| Ac | .983 | .985 | .984 | |
| .978 | .980 | .979 | ||
| Sn | .981 | .987 | .985 | |
| Sp | .985 | .983 | .983 | |
| Inputs with 50 components | ||||
| Ac | .974 | .969 | .977 | |
| .966 | .960 | .969 | ||
| Sn | .970 | .968 | .961 | |
| Sp | .977 | .970 | .987 | |
k-fold cross-validations for performed by DT and SVM classifiers run on non-reduced and SVD-reduced inputs respectively. The results are presented as mean (μ) standard deviation (σ) of 4 measures of performance: accuracy (Ac), -score (), sensitivity (Sn), and specificity (Sp). The best performances (with the greatest value ) are shown in bold.
| 2-fold | 3-fold | 5-fold | 10-fold | |
|---|---|---|---|---|
| DT, inputs with 59,900 components | ||||
| Ac | .983 | .973 | .974 | |
| .978 | .967 | .970 | ||
| Sn | .991 | .989 | .994 | |
| Sp | .978 | .964 | .963 | |
| SVM, inputs with 50 components | ||||
| Ac | .972 | .986 | .989 | |
| .962 | .981 | .986 | ||
| Sn | .950 | .989 | .998 | |
| Sp | .985 | .983 | .983 | |
3-fold cross-validation of DT and SVM classifier run on inputs without and with dimensionality reduction respectively. The results are presented as mean (μ standard deviation (σ) of 4 measures of performance: accuracy (Ac), -score (), sensitivity (Sn), and specificity (Sp).
| DT, inputs with 59,900 components | |||
| Ac | .986 | .977 | .978 |
| .982 | .974 | .986 | |
| Sn | .996 | .957 | .984 |
| Sp | .980 | .995 | .960 |
| SVM, inputs with 50 components | |||
| Ac | .986 | .976 | .987 |
| .981 | .972 | .992 | |
| Sn | .989 | .957 | 1.000 |
| Sp | .983 | .992 | .947 |
Table S3 demonstrates decent results for all performed classifications (referring to the 4 statistical metrics used – accuracy, -score, sensitivity, and specificity) reaching more than 98% for , 95% for , and 94% for Important sites. We used DT to identify important sites in classification, see Table S4. Only two sites (1483 and 2258) had high importance of greater than 0.80. Remarkably, they appeared in each run of DT classifier independently of the number of splits k. All other sites used in DT had importance of less than 0.13. As k increases, the proportion of occurrences of the two sites changes in favor of 2258, reaching 100% in the 10-fold split.
The average importance and number of occurrences (in parentheses) for deciding sites identified by DT classifier in . The classifier was run times, where k is the number of folds.
| Site | 2-fold | 3-fold | 5-fold | 10-fold |
|---|---|---|---|---|
| 1483 | .811 (3) | .803 (2) | .800 (1) | NA (0) |
| S2258 | .807 (17) | .806 (28) | .806 (49) | 0.806 (100) |