| Literature DB >> 34829338 |
Syed Nisar Hussain Bukhari1, Amit Jain1, Ehtishamul Haq2, Abolfazl Mehbodniya3, Julian Webber4.
Abstract
An ongoing outbreak of coronavirus disease 2019 (COVID-19), caused by a single-stranded RNA virus called severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has caused a worldwide pandemic that continues to date. Vaccination has proven to be the most effective technique, by far, for the treatment of COVID-19 and to combat the outbreak. Among all vaccine types, epitope-based peptide vaccines have received less attention and hold a large untapped potential for boosting vaccine safety and immunogenicity. Peptides used in such vaccine technology are chemically synthesized based on the amino acid sequences of antigenic proteins (T-cell epitopes) of the target pathogen. Using wet-lab experiments to identify antigenic proteins is very difficult, expensive, and time-consuming. We hereby propose an ensemble machine learning (ML) model for the prediction of T-cell epitopes (also known as immune relevant determinants or antigenic determinants) against SARS-CoV-2, utilizing physicochemical properties of amino acids. To train the model, we retrieved the experimentally determined SARS-CoV-2 T-cell epitopes from Immune Epitope Database and Analysis Resource (IEDB) repository. The model so developed achieved accuracy, AUC (Area under the ROC curve), Gini, specificity, sensitivity, F-score, and precision of 98.20%, 0.991, 0.994, 0.971, 0.982, 0.990, and 0.981, respectively, using a test set consisting of SARS-CoV-2 peptides (T-cell epitopes and non-epitopes) obtained from IEDB. The average accuracy of 97.98% was recorded in repeated 5-fold cross validation. Its comparison with 05 robust machine learning classifiers and existing T-cell epitope prediction techniques, such as NetMHC and CTLpred, suggest the proposed work as a better model. The predicted epitopes from the current model could possess a high probability to act as potential peptide vaccine candidates subjected to in vitro and in vivo scientific assessments. The model developed would help scientific community working in vaccine development save time to screen the active T-cell epitope candidates of SARS-CoV-2 against the inactive ones.Entities:
Keywords: COVID-19; SARS-CoV-2; T-cell epitope; ensemble learning; machine learning; peptide-based vaccines; random forest; voting ensemble
Year: 2021 PMID: 34829338 PMCID: PMC8617960 DOI: 10.3390/diagnostics11111990
Source DB: PubMed Journal: Diagnostics (Basel) ISSN: 2075-4418
Existing methods for T-cell epitope prediction.
| Sr. No | Method Name | Usage |
|---|---|---|
| 01 | NetMHC [ | To predict HLA I class or CD8+ T-cell epitopes |
| 07 | NetHMCII_2.3 [ | To predict HLA II class or CD4+ T-cell epitopes |
Figure 1Proposed methodology.
Physicochemical properties used.
| Feature Category | Physicochemical Property | Category Count | Notations Used |
|---|---|---|---|
| F1 | Aliphatic Index | 1 | F1 |
| F2 | Boman Index | 1 | F2 |
| F3 | Insta Index | 1 | F3 |
| F4 | Probability of detection | 1 | F4 |
| F5 | Cross-covariance index | 1 | F5 |
| F6 | Hmoment Index | 2 | F6_1, F6_2 |
| F7 | Molecular Weight | 2 | F7_1, F7_2 |
| F8 | Peptide Charge for 45 scales | 45 | F8_1 to F8_45 |
| F9 | Hydrophobicity at 44 scales | 44 | F9_1 to F9_44 |
| F10 | Isoelectric Point for 9 pK scale | 9 | F10_1 to F10_9 |
| F11 | Kidera Factors | 10 | F11_1 to F11_10 |
| F12 | aaComp | 18 | F12_1 to F12_18 |
| F13 | FASGAI vectors | 6 | F13_1 to F13_6 |
| F14 | blosumIndices | 10 | F14_1 to F14_10 |
| F15 | protFP descriptors | 8 | F15_1 to F15_8 |
| F16 | Cruciani properties | 3 | F16_1 to F16_3 |
Glimpse of the dataset.
| Peptide Sequence | F1 | F2 | ----- | F16_2 | F16_3 | Class |
|---|---|---|---|---|---|---|
| AFFGMSRIGMEVTPSGTW | 43.33 | 0.3938 | ----- | −0.302 | 0.082 | 1 |
| HLMGWDYPK | 43.33 | 0.9477 | ----- | −0.091 | −0.022 | 1 |
| TGTLIVNSVLLFLAF | 175.33 | 1.7473 | ----- | −0.284 | −0.03 | 0 |
| SVLLFLAFVVFLLVT | 214 | −3.036 | ----- | −0.156 | −0.15 | 0 |
Important features selected by Boruta.
| Rank | Feature | Rank | Feature |
|---|---|---|---|
| 1 | F1 | 11 | F9_38 |
| 2 | F2 | 12 | F10_2 |
| 3 | F4 | 13 | F10_7 |
| 4 | F6_2 | 14 | F11_5 |
| 5 | F8_5 | 15 | F12_5 |
| 6 | F8_19 | 16 | F12_7 |
| 7 | F8_34 | 17 | F13_4 |
| 8 | F9_4 | 18 | F14_9 |
| 9 | F9_6 | 19 | F15_3 |
| 10 | F9_29 | 20 | F15_4 |
Figure 2Proposed ensemble model.
Classifiers used for comparison.
| Model Name | Tuned Parameters | Method Name | Package Name |
|---|---|---|---|
| Neural network (NN) | size:10 | nnet | nnet |
| Decision tree (DT) | maxsurrogate:0 | rpart | rpart |
| Support vector | type:svc and kernel: rbfdot | ksvm | kernlab |
| Random forest (RF) | ntree:500 and mtry:2 | randomForest | randomForest |
| adaBoost (ada) | type: “discrete”, | ada | ada |
Figure 3Confusion matrix.
Figure 4K-fold cross-validation process (K = 5).
Results of proposed and existing prediction models on test dataset.
| Model | Accuracy (%) | AUC | Gini | Sensitivity | Specificity | F-Score | Precision |
|---|---|---|---|---|---|---|---|
| Neural Network | 95.66 | 0.981 | 0.980 | 0.959 | 0.971 | 0.910 | 0.929 |
| Decision Tree | 94.81 | 0.978 | 0.929 | 0.979 | 0.959 | 0.979 | 0.939 |
| SVM | 96.32 | 0.982 | 0.932 | 0.981 | 0.946 | 0.957 | 0.948 |
| RandomForest | 97.11 | 0.963 | 0.910 | 0.961 | 0.941 | 0.964 | 0.971 |
| adaBoost | 95.87 | 0.989 | 0.978 | 0.961 | 0.957 | 0.959 | 0.976 |
|
|
|
|
|
|
|
|
|
Figure 5Performance comparison bar chart of individual and proposed ensemble models.
Figure 6ROC curve of the proposed ensemble model.
Repeated five (05)-fold cross-validation results.
| Fold | Iteration 1 | Iteration 2 | Iteration 3 | Iteration 4 | Iteration 5 | |
|---|---|---|---|---|---|---|
| 1 | 98.24 | 98.21 | 97.88 | 98.19 | 97.89 | |
| 2 | 97.91 | 97.87 | 97.65 | 98.01 | 97.89 | |
| 3 | 98.11 | 98.15 | 98.76 | 97.34 | 97.90 | |
| 4 | 98.03 | 98.32 | 97.65 | 98.21 | 98.02 | |
| 5 | 97.71 | 98.30 | 97.96 | 97.89 | 97.62 | |
| Mean Acc./iteration ( |
|
|
|
|
|
|
Figure 7Accuracy plot of repeated K-fold cross validation (K = 5).
Validation results of the proposed ensemble model and its comparison with existing techniques.
| SARS-CoV-2 Peptide Sequences | Actual Class | Binding | Predictions by CTLpred | Predictions by the Proposed Model |
|---|---|---|---|---|
| APAICHD | 1 | 37 | 1 | 1 |
| TAPAICHD | 1 | 58 | 1 | 1 |
| QLNRALTGIAVEQDK | 1 | 6.2 | - | 1 |
| NFSQILPDPSKPSKR | 1 | 3.1 | - | 1 |
| DILSRLD | 1 | 65 | 1 | 1 |
| TGSNVFQTR | 1 | 45 | 1 | 1 |
| HSSGVTREL | 1 | 23 | 1 | 1 |
| YICGFIQQK | 1 | 4.2 | 1 | 1 |
| VVCTEIDPK | 1 | 8.2 | 1 | 1 |
| TIWFLLLSV | 1 | 76 | 1 | 1 |
| TIADYNYKL | 1 | 9.8 | 1 | 1 |
| SYYSLLMPI | 1 | 65 | 1 | 1 |
| SVKGLQPSV | 1 | 12 | 1 | 1 |
| SQDLSVVSKT | 1 | 19 | - | 1 |
| QLEMELTPV | 1 | 42 | 1 | 1 |
| QLEMELTPV | 1 | 7.3 | 1 | 1 |
| NYNYRYRLF | 1 | 1.9 | 1 | 1 |
| NIADYNYKL | 1 | 44 | 1 | 1 |
| LLIIMRTFK | 1 | 71 | 1 | 1 |
| KLDGFMGRI | 1 | 6.0 | 1 | 1 |
| HTITVEELK | 0 | 4.6 | 0 | 0 |
| SVKHVYQL | 0 | 52 | 0 | 0 |
| EYHLMSFPQSAPHGV | 0 | 79 | - | 0 |
| DIKNLSKSL | 0 | 80 | 0 | 0 |
| VWNLDY | 0 | 40 | 0 | 0 |
| VTLAILTAL | 0 | 32 | 0 | 0 |
| YLNTLTLAV | 0 | 41.2 | 0 | 0 |
| EPVLKGVKL | 0 | 5.6 | 0 | 0 |
| AAGLEAPFL | 0 | 9.3 | 0 | 0 |
| WTAGAAAYY | 0 | 4.4 | 0 | 0 |
| YLDGADVTK | 0 | 83 | 0 | 0 |
| SQLGGLHLL | 0 | 65 | 0 | 0 |
| LVKPSFYVY | 0 | 12 | 0 | 0 |
| LPYPDPSRI | 0 | 15.7 | 0 | 0 |
| AEWFLAYIL | 0 | 4.4 | 0 | 0 |
| VLLSVLQQL | 0 | 11 | 0 | 0 |
| SLPSYAAFATA | 0 | 89 | - | 0 |
| TLMNVLTLV | 0 | 37 | 0 | 0 |
| IPLTTAAKL | 0 | 61 | 0 | 0 |