| Literature DB >> 33856796 |
Srinivasulu Yerukala Sathipati1,2,3, Shinn-Ying Ho2,4,5,6.
Abstract
There is an urgent need to elucidate the underlying mechanisms of coronavirus disease (COVID-19) so that vaccines and treatments can be devised. Severe acute respiratory syndrome coronavirus 2 has genetic similarity with bats and pangolin viruses, but a comprehensive understanding of the functions of its proteins at the amino acid sequence level is lacking. A total of 4320 sequences of human and nonhuman coronaviruses was retrieved from the Global Initiative on Sharing All Influenza Data and the National Center for Biotechnology Information. This work proposes an optimization method COVID-Pred with an efficient feature selection algorithm to classify the species-specific coronaviruses based on physicochemical properties (PCPs) of their sequences. COVID-Pred identified a set of 11 PCPs using a support vector machine and achieved 10-fold cross-validation and test accuracies of 99.53% and 97.80%, respectively. These findings could provide key insights into understanding the driving forces during the course of infection and assist in developing effective therapies.Entities:
Keywords: SARS-CoV-2 classification; machine learning; physicochemical properties; support vector machines
Year: 2021 PMID: 33856796 PMCID: PMC8056951 DOI: 10.1021/acs.jproteome.1c00156
Source DB: PubMed Journal: J Proteome Res ISSN: 1535-3893 Impact factor: 4.466
Performance Comparisons of COVID-Pred
| 10-CV (%) | MCC | SN | SP | AUC | |
|---|---|---|---|---|---|
| Naive Bayes | 89.80 | 0.80 | 0.98 | 0.84 | 0.96 |
| MLP | 92.10 | 0.84 | 0.96 | 0.88 | 0.96 |
| SMO | 88.15 | 0.77 | 0.98 | 0.82 | 0.87 |
| SGD | 91.77 | 0.84 | 0.98 | 0.87 | 0.91 |
| LMT | 90.78 | 0.81 | 0.93 | 0.88 | 0.95 |
| J48 | 92.43 | 0.84 | 0.93 | 0.91 | 0.93 |
| decision tree | 83.55 | 0.69 | 0.96 | 0.76 | 0.82 |
| random forest | 96.38 | 0.92 | 0.97 | 0.95 | 0.98 |
| COVID-Pred | 99.67 | 0.99 | 1.00 | 0.99 | 0.99 |
| COVID-Pred (mean) | 99.38 ± 0.11 | 0.98 ± 0.003 | 0.99 ± 0.003 | 0.99 ± 0.001 | 0.99 ± 0.001 |
MED Analysis
| rank | AAindex-ID | AAindex-desc | MED |
|---|---|---|---|
| 1 | FAUJ880103 | normalized van der Waals volume | 9.94 |
| 2 | ONEK900101 | delta G values for the peptides extrapolated to 0 M urea | 9.33 |
| 3 | PALJ810116 | normalized frequency of turn in α/β class | 8.05 |
| 4 | AURR980102 | normalized positional residue frequency at the helix termini N″’ | 6.83 |
| 5 | FAUJ880106 | STERIMOL maximum width of the side chain | 6.56 |
| 6 | TANS770103 | normalized frequency of the extended structure | 6.56 |
| 7 | FASG760101 | molecular weight | 5.68 |
| 8 | MONM990101 | turn propensity scale for transmembrane helices | 4.33 |
| 9 | AURR980116 | normalized positional residue frequency at the helix termini Cc | 3.80 |
| 10 | DAYM780201 | relative mutability | 1.91 |
| 11 | RICJ880117 | relative preference value at C″ | 0.50 |
Figure 1Comparison of PCPs between the HCoV and nHCoV proteins. (A) FAUJ880103, (B) ONEK900101, (C) PALJ810116, (D) AURR980102, (E) FAUJ880106, (F) TANS770103, (G) FASG760101, (H) MONM990101, (I) AURR980116, (J) DAYM780201, and (K) RICJ880117.
Figure 2Graphical representation of the analyzed informative PCPs using the secondary structure of 6M0J as a model.
Figure 3Visualization of the S glycoprotein with mutations. (A) Structure of the SARS-CoV S protein (PDB: 6acc, EM 3.6 Angstrom). (B) S glycoprotein (PDB: 6acj, EM 4.2 Angstrom) in complex with the host cell receptor ACE2 (green ribbon); mutations identified in the query sequences are shown as colored balls (based on the nearest residue if in the loop/termini region).
Figure 4Normalized amino acid compositional preferences showing differences in the 11 PCPs between HCoV and nHCoV.
Figure 5Amino acid and dipeptide compositional analysis. (A) Amino acid compositional differences between HCoV and nHCoV and (B) heatmap showing dipeptide compositional differences between HCoV and nHCoV.
AAC Difference between the HCoV and nHCoV Proteins
| amino acids | HCoV | composition difference | |
|---|---|---|---|
| L | 11% | 9% | 2% |
| K | 5% | 4% | 2% |
| E | 5% | 3% | 1% |
| H | 2% | 1% | 1% |
| R | 4% | 3% | 1% |
| M | 2% | 2% | 1% |
| W | 1% | 1% | 0% |
| P | 4% | 4% | 0% |
| D | 5% | 5% | 0% |
| C | 3% | 3% | 0% |
| F | 5% | 5% | 0% |
| I | 6% | 6% | 0% |
| A | 6% | 6% | 0% |
| Y | 4% | 5% | –1% |
| Q | 4% | 5% | –1% |
| G | 6% | 7% | –1% |
| V | 7% | 8% | –1% |
| S | 7% | 8% | –1% |
| T | 7% | 8% | –1% |
| N | 5% | 7% | –2% |