| Literature DB >> 32209118 |
Xiao-Li Qiang1, Peng Xu1, Gang Fang1, Wen-Bin Liu1, Zheng Kou2.
Abstract
BACKGROUND: Coronavirus can cross the species barrier and infect humans with a severe respiratory syndrome. SARS-CoV-2 with potential origin of bat is still circulating in China. In this study, a prediction model is proposed to evaluate the infection risk of non-human-origin coronavirus for early warning.Entities:
Keywords: Coronavirus; Cross-species infection; Machine learning; Spike protein
Mesh:
Substances:
Year: 2020 PMID: 32209118 PMCID: PMC7093988 DOI: 10.1186/s40249-020-00649-8
Source DB: PubMed Journal: Infect Dis Poverty ISSN: 2049-9957 Impact factor: 4.520
Summary of feature descriptors
| Feature | Type | Dimension | Feature | Type | Dimension |
|---|---|---|---|---|---|
| 1 | PseAAC (λ = 1) | 21 | 22 | GGAP (g = 0) | 400 |
| 2 | PseAAC (λ = 2) | 22 | 23 | GGAP (g = 1) | 400 |
| 3 | PseAAC (λ = 3) | 23 | 24 | GGAP (g = 2) | 400 |
| 4 | PseAAC (λ = 4) | 24 | 25 | GGAP (g = 3) | 400 |
| 5 | PseAAC (λ = 5) | 25 | 26 | GGAP (g = 4) | 400 |
| 6 | PseAAC (λ = 6) | 26 | 27 | GGAP (g = 5) | 400 |
| 7 | PseAAC (λ = 7) | 27 | 28 | GGAP (g = 6) | 400 |
| 8 | PseAAC (λ = 8) | 28 | 29 | GGAP (g = 7) | 400 |
| 9 | PseAAC (λ = 9) | 29 | 30 | GGAP (g = 8) | 400 |
| 10 | PseAAC (λ = 10) | 30 | 31 | GGAP (g = 9) | 400 |
| 11 | PseAAC (λ = 11) | 31 | 32 | GGAP (g = 10) | 400 |
| 12 | PseAAC (λ = 12) | 32 | 33 | GGAP (g = 11) | 400 |
| 13 | PseAAC (λ = 13) | 33 | 34 | GGAP (g = 12) | 400 |
| 14 | PseAAC (λ = 14) | 34 | 35 | GGAP (g = 13) | 400 |
| 15 | PseAAC (λ = 15) | 35 | 36 | GGAP (g = 14) | 400 |
| 16 | PseAAC (λ = 16) | 36 | 37 | GGAP (g = 15) | 400 |
| 17 | PseAAC (λ = 17) | 37 | 38 | GGAP (g = 16) | 400 |
| 18 | PseAAC (λ = 18) | 38 | 39 | GGAP (g = 17) | 400 |
| 19 | PseAAC (λ = 19) | 39 | 40 | GGAP (g = 18) | 400 |
| 20 | PseAAC (λ = 20) | 40 | 41 | GGAP (g = 19) | 400 |
| 21 | AAC | 20 |
GGAP G-gap dipeptide composition, PseAAC Pseudo-amino-acid composition, AAC Amino acid composition
Fig. 1Schematic framework of machine learning. First, feature representations from three feature descriptors are obtained. Second, the RF method is used to train and test the dataset and make predictions for cross-species transmission of coronavirus. NGDC: National Genomics Data Center; AAC: Amino acid composition; PC-PseAAC: Parallel correlation-based pseudo-amino-acid composition; GGAP: G-gap dipeptide composition; RF: Random forest
Results of feature representations
| Feature | ACC | SN | SP | MCC | TP | TN | FP | FN |
|---|---|---|---|---|---|---|---|---|
| GGAP (g = 3) | 98.18 | 99.16 | 97.26 | 0.9638 | 2011 | 2100 | 59 | 17 |
| PC-PseAAC (λ = 2) | 96.36 | 98.61 | 94.25 | 0.9284 | 2000 | 2035 | 124 | 28 |
| AAC | 96.15 | 98.61 | 93.83 | 0.9243 | 2000 | 2026 | 133 | 28 |
ACC Accuracy, SN Sensitivity, SP Specificity, MCC Matthews correlation coefficient, TP True positive, TN True negative, FP False positive, FN False negative, GGAP G-gap dipeptide composition, PC-PseAAC Parallel correlation-based pseudo-amino-acid composition, AAC Amino acid composition
Fig. 2Predictive performance of feature representations. a Ten-fold cross-validation results. b Receiver operating characteristic curves generated by plotting the true positive rate (TPR) against the false positive rate (FPR) under different classification thresholds. ACC: Accuracy; SN: Sensitivity; SP: Specificity; MCC: Matthews correlation coefficient; AAC: Amino acid composition; GGAP: G-gap dipeptide composition; PC-PseAAC: Parallel correlation-based pseudo-amino-acid composition
Fig. 3Patterns of human coronavirus clustered using the multidimensional scaling method. The x and y coordinates denote the first main factor and second main factor, respectively. SARS-CoV-2 is indicated by the blue solid circle
Fig. 4Evolutionary dynamic of SARS-CoV-2 and SARS-CoV. a Euclidean distance between SARS-CoV-2 and other coronaviruses in the dataset. b Euclidean distance between SARS-CoV and other coronaviruses in the dataset. The x and y coordinates denote the strain number and Euclidean distance based on the GGAP (g = 3) feature, respectively