| Literature DB >> 35714506 |
Puneet Rawat1, Divya Sharma2, Medha Pandey2, R Prabakaran2, M Michael Gromiha3.
Abstract
The prolonged transmission of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus in the human population has led to demographic divergence and the emergence of several location-specific clusters of viral strains. Although the effect of mutation(s) on severity and survival of the virus is still unclear, it is evident that certain sites in the viral proteome are more/less prone to mutations. In fact, millions of SARS-CoV-2 sequences collected all over the world have provided us a unique opportunity to understand viral protein mutations and develop novel computational approaches to predict mutational patterns. In this study, we have classified the mutation sites into low and high mutability classes based on viral isolates count containing mutations. The physicochemical features and structural analysis of the SARS-CoV-2 proteins showed that features including residue type, surface accessibility, residue bulkiness, stability and sequence conservation at the mutation site were able to classify the low and high mutability sites. We further developed machine learning models using above-mentioned features, to predict low and high mutability sites at different selection thresholds (ranging 5-30% of topmost and bottommost mutated sites) and observed the improvement in performance as the selection threshold is reduced (prediction accuracy ranging from 65 to 77%). The analysis will be useful for early detection of variants of concern for the SARS-CoV-2, which can also be applied to other existing and emerging viruses for another pandemic prevention.Entities:
Keywords: COVID-19; Machine learning; Mutation; Protein mutability; SARS-CoV-2
Mesh:
Substances:
Year: 2022 PMID: 35714506 PMCID: PMC9173821 DOI: 10.1016/j.compbiomed.2022.105708
Source DB: PubMed Journal: Comput Biol Med ISSN: 0010-4825 Impact factor: 6.698
Fig. 1Workflow illustrating the steps followed in the current study.
Fig. 2A histogram plotted for the number of isolates observed with respect to number of mutation sites. Approximately 90% of the mutation sites have less than 1000 isolates containing mutation, although the highest isolate count is 1,079,273.
Fig. 3Amino acid frequency in low and high mutation sites class.
Fig. 4Major features under the category of surface accessibility, residue bulkiness, stability of mutation site and conservation of the mutation site (p-value<10−11).
Performance of the baseline model at 30% selection threshold.
| Performance Measure | Accuracy | Sensitivity | Specificity | ROC |
|---|---|---|---|---|
| Training dataset | 65 | 62.4 | 67.7 | 0.711 |
| Leave-one-out cross-validation | 60.03 | 57.9 | 62.2 | .648 |
| 10-fold cross-validation | 60.2 ± 0.34 | 57.6 ± 0.51 | 62.7 ± 0.46 | 0.646 ± 0.002 |
The average values are listed along with standard deviation from 100 iterations after randomizing data each time.
Performance of the baseline model at different selection threshold range.
| Selection threshold | Dataset | Performance measures | |||||
|---|---|---|---|---|---|---|---|
| Total mutation sites | Low mutability sites | High mutability sites | Accuracy | Sensitivity | Specificity | ROC | |
| 5 | 864 | 430 | 434 | 76.7 | 76.5 | 77 | 0.84 |
| 10 | 1748 | 881 | 867 | 72.8 | 73 | 72.5 | 0.795 |
| 15 | 2589 | 1288 | 1301 | 69.9 | 68.3 | 71.5 | 0.761 |
| 20 | 3453 | 1718 | 1735 | 68.4 | 66.5 | 70.3 | 0.747 |
| 25 | 4357 | 2187 | 2170 | 66.8 | 66.1 | 67.5 | 0.73 |
| 30 | 5204 | 2600 | 2604 | 65 | 62.4 | 67.7 | 0.711 |
The features related to mutation probability analyzed for the mutation of concern and mutation of interest.
| Mutation sites of concern/interest | Surface accessibility | Residue bulkiness | Stability of the mutation site | Conservation of the mutation site |
|---|---|---|---|---|
| 1.05 | 68.7 | 3.46 | −0.4 | |
| 82.97 | 3.92 | −0.65 | ||
| 0.64 | 81.13 | 2.94 | −0.25 | |
| 0.39 | 68.77 | 4.36 | −0.45 | |
| −0.25 | ||||
| 0.96 | 54.13 | 3.82 | −0.35 | |
| 0.63 | −0.55 | |||
| 0.41 | 61.07 | 3.1 | −0.3 | |
| 0.61 | 79.2 | 4.6 | −0.85 | |
| 0.3 | 82.98 | 5.16 | −0.94 |
Note: The average values are calculated from the mutation sites considered in the current study of SARS-CoV-2 proteome. These mutations of concern/interest are expected to be present at the high mutability sites. The features that do not follow the observed trend in the study are highlighted.
The list of mutations obtained from https://outbreak.info/.
Mutation of concern (MOC): S:E484K.
Mutation of interest (MOI): S:L18F; S:K417N; S:K417T; S:N439K; S:L452R; S:S477N; S:S494P; S:N501Y; S:P681H; S:P681R.
The mutation sites of variants of concern and interest analyzed for four physicochemical features.
| Variant of concern | Mutation sites | Number of features satisfying criteria | ||||
|---|---|---|---|---|---|---|
| 4 | 3 | 2 | 1 | 0 | ||
| Delta (B.1.617.2) | 24 | 11 (45.8%) | 7 (29.2%) | 3 (12.5%) | 2 (8.3%) | 1 (4.2%) |
| Alpha (B.1.1.7) | 19 | 10 (52.6%) | 4 (21.1%) | 5 (26.3%) | 0 (0%) | 0 (0%) |
| Beta (B.1.351) | 16 | 8 (50%) | 4 (25%) | 2 (12.5%) | 2 (12.5%) | 0 (0%) |
| Gamma (P.1) | 22 | 13 (59.1%) | 5 (22.7%) | 3 (13.6%) | 1 (4.5%) | 0 (0%) |
| Lambda (C.37) | 19 | 7 (36.8%) | 5 (26.3%) | 4 (21.1%) | 3 (15.8%) | 0 (0%) |
| Mu (B.1.621) | 20 | 11 (55%) | 2 (10%) | 6 (30%) | 1 (5%) | 0 (0%) |
As per the study, the satisfactory criteria for the feature is: 1. High mutability sites are likely to have higher than average value for surface accessibility and conservation, and vice versa.
2. High mutability sites are likely to have lower than average value for residue bulkiness, and stability, and vice versa.