| Literature DB >> 34631001 |
Syed Nisar Hussain Bukhari1, Amit Jain1, Ehtishamul Haq2, Moaiad Ahmad Khder3, Rahul Neware4, Jyoti Bhola5, Moslem Lari Najafi6.
Abstract
Zika virus (ZIKV), the causative agent of Zika fever in humans, is an RNA virus that belongs to the genus Flavivirus. Currently, there is no approved vaccine for clinical use to combat the ZIKV infection and contain the epidemic. Epitope-based peptide vaccines have a large untapped potential for boosting vaccination safety, cross-reactivity, and immunogenicity. Though many attempts have been made to develop vaccines for ZIKV, none of these have proved to be successful. Epitope-based peptide vaccines can act as powerful alternatives to conventional vaccines due to their low production cost, less reactogenic, and allergenic responses. For designing an effective and viable epitope-based peptide vaccine against this deadly virus, it is essential to select the antigenic T-cell epitopes since epitope-based vaccines are considered safe. The in silico machine-learning-based approach for ZIKV T-cell epitope prediction would save a lot of physical experimental time and efforts for speedy vaccine development compared to in vivo approaches. We hereby have trained a machine-learning-based computational model to predict novel ZIKV T-cell epitopes by employing physicochemical properties of amino acids. The proposed ensemble model based on a voting mechanism works by blending the predictions for each class (epitope or nonepitope) from each base classifier. Predictions obtained for each class by the individual classifier are summed up, and the class with the majority vote is predicted upon. An odd number of classifiers have been used to avoid the occurrence of ties in the voting. Experimentally determined ZIKV peptide sequences data set was collected from Immune Epitope Database and Analysis Resource (IEDB) repository. The data set consists of 3,519 sequences, of which 1,762 are epitopes and 1,757 are nonepitopes. The length of sequences ranges from 6 to 30 meter. For each sequence, we extracted 13 physicochemical features. The proposed ensemble model achieved sensitivity, specificity, Gini coefficient, AUC, precision, F-score, and accuracy of 0.976, 0.959, 0.993, 0.994, 0.989, 0.985, and 97.13%, respectively. To check the consistency of the model, we carried out five-fold cross-validation and an average accuracy of 96.072% is reported. Finally, a comparative analysis of the proposed model with existing methods has been carried out using a separate validation data set, suggesting the proposed ensemble model as a better model. The proposed ensemble model will help predict novel ZIKV vaccine candidates to save lives globally and prevent future epidemic-scale outbreaks.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34631001 PMCID: PMC8500748 DOI: 10.1155/2021/9591670
Source DB: PubMed Journal: J Healthc Eng ISSN: 2040-2295 Impact factor: 2.682
Snapshot of the data set.
| Peptide sequence | SL | F1 | F2 | F10 | F11 | Class |
|---|---|---|---|---|---|---|
| GSLQLLAIE | 9 | 184.4444 | −0.74222 | −0.56222 | 3 | 0 |
| EEQRYTCHVQHEGLPKPLTLRW | 22 | 66.36364 | 2.643182 | −0.08955 | 4 | 0 |
| LQSNGWDRLKRMAVS | 15 | 78 | 2.785333 | −0.07 | 4 | 1 |
| YKYKVVKIEPLGVA | 14 | 125 | −0.06929 | −0.12429 | 2 | 0 |
| GDTLKECPLKHRAWNSFL | 18 | 70.55556 | 1.928889 | −0.02667 | 5 | 1 |
| HMCDATMSY | 9 | 11.11111 | 1.3 | −0.18 | 4 | 1 |
| KAFEATVRGAKRMAV | 15 | 65.33333 | 1.915333 | −0.62 | 6 | 1 |
| CKRGIKSGS | 9 | 43.33333 | 2.748889 | 0.385556 | 5 | 0 |
| WASRELERF | 9 | 54.44444 | 3.868889 | −0.46222 | 2 | 0 |
| AVRHFPRIW | 9 | 86.66667 | 2.046667 | −0.09444 | 1 | 0 |
Physicochemical properties used in the current study.
| Sr. no. | Property name | Package | Function name | Notation |
|---|---|---|---|---|
| 1 | Aliphatic index | Peptides | aIndex (seq) | F1 |
| 2 | Potential protein interaction index | Peptides | Boman (seq) | F2 |
| 3 | Instability index of a protein sequence | Peptides | instaIndex (seq) | F3 |
| 4 | Probability of detection of a peptide | Peptider | Ppeptide (x, libscheme, N) | F4 |
| 5 | Hydrophobic moment | Peptides | hmoment (seq, angle) | F5_1, F5_2 |
| 6 | Molecular weight | Peptides | Mw (seq, monoisotopic) | F6_1, F6_2 |
| 7 | Theoretical net charge at 9 pKa scales | Peptides | charge | F7 |
| 8 | Hydrophobicity index | Peptides | Hydrophobicity | F8 |
| 9 | Isoelectric point | Peptides | pI | F9 |
| 10 | Kidera factors | Peptides | kideraFactors | F10 |
| 11 | Amino acid composition | Peptides | aaComp | F11 |
Feature importance score.
| Feature | Score |
|---|---|
| F4 | 60.53 |
| F6_2 | 52.03 |
| F6_1 | 51.95 |
| F8 | 46.15 |
| F2 | 44.18 |
| F10 | 43.43 |
| F9 | 42.25 |
| F5_1 | 41.69 |
| F1 | 40.87 |
| F3 | 39.49 |
| F5_2 | 38.08 |
| F7 | 36.36 |
| F11 | 30.52 |
Figure 1Feature importance line plot.
Figure 2Workflow for classification of peptide sequences.
Figure 3Methodology used.
Figure 4The proposed ensemble model for ZIKV T-cell epitope prediction.
Machine-learning classifiers were used in the current study.
| Sr. no. | Classifier |
| Tuned parameters |
|---|---|---|---|
| 01 | Decision trees [ | rpart | maxsurrogate = 0, usesurrogate = 0 |
| 02 | Neural network [ | nnet | Size = 10, maxit = 100 |
| 03 | Support vector machine [ | ksvm | kernel = “rbfdot,” type = C-svc” |
| 04 | AdaBoost [ | ada | Iter = 50, type = “discrete,” nu = 0.5 |
| 05 | Random forest [ | randomForest | ntree = 500, mtry = 2 |
Performance comparison of existing models with the proposed ensemble.
| Model | Gini | Precision | F-score | AUC | Sensitivity | Specificity | Accuracy (%) |
|---|---|---|---|---|---|---|---|
| Random forest | 0.905 | 0.963 | 0.958 | 0.952 | 0.953 | 0.921 | 94.29 |
| Neural network | 0.990 | 0.936 | 0.951 | 0.973 | 0.948 | 0.963 | 96.52 |
| AdaBoost | 0.988 | 0.985 | 0.963 | 0.994 | 0.942 | 0.972 | 95.24 |
| Decision tree | 0.987 | 0.972 | 0.972 | 0.993 | 0.972 | 0.938 | 96.19 |
| SVM | 0.912 | 0.979 | 0.975 | 0.995 | 0.972 | 0.956 | 96.67 |
| Proposed ensemble model | 0.993 | 0.989 | 0.985 | 0.994 | 0.976 | 0.959 | 97.13 |
Figure 5Comparison chart of existing models with the proposed model.
Figure 6ROC plot of the proposed ensemble model.
Five-fold cross-validation.
| Fold | Accuracy |
|---|---|
| 1 | 96.27 |
| 2 | 95.28 |
| 3 | 97.52 |
| 4 | 96.49 |
| 5 | 94.80 |
Figure 7Five-fold cross-validation results of the proposed ensemble model.
Validation results of the proposed ensemble model and its comparison with existing methods.
| Peptide sequence | Actual target | Comparison with NetMHC | Comparison with CTLpred | ||
|---|---|---|---|---|---|
| Binding capacity by NetMHC | Predictions by the proposed model | Predictions by CTLpred | Predictions by the proposed model | ||
| NSFVVDGDT | Epitope | 49 | 1 | Epitope | 1 |
| VREDYSLECDPAVIG | Epitope | 25 | 1 | — | 1 |
| AQMAVDMQT | Epitope | 3.9 | 1 | Epitope | 1 |
| FVVDGDTLKECPLKH | Epitope | 2.2 | 1 | — | 1 |
| GEAYLDKQ | Epitope | 75 | 1 | Nonepitope | 1 |
| GPSLRSTTASGRVIE | Epitope | 34 | 1 | — | 1 |
| MEIRPRKEPESNLVR | Epitope | 65 | 1 | — | 1 |
| TRGPSLRST | Epitope | 7.2 | 1 | Epitope | 1 |
| MLRIINARG | Non epitope | 3.4 | 0 | Nonepitope | 0 |
| IQIMDLGHMATC | Non epitope | 56 | 0 | — | 0 |
| LVTCAKMQ | Non epitope | 80 | 0 | Nonepitope | 0 |
| LGGFGSL | Non epitope | 78 | 0 | Epitope | 0 |
| VVVLGSQERIN | Non epitope | 34 | 0 | — | 0 |