| Literature DB >> 35215090 |
Syed Nisar Hussain Bukhari1, Amit Jain1, Ehtishamul Haq2, Abolfazl Mehbodniya3, Julian Webber4.
Abstract
The only part of an antigen (a protein molecule found on the surface of a pathogen) that is composed of epitopes specific to T and B cells is recognized by the human immune system (HIS). Identification of epitopes is considered critical for designing an epitope-based peptide vaccine (EBPV). Although there are a number of vaccine types, EBPVs have received less attention thus far. It is important to mention that EBPVs have a great deal of untapped potential for boosting vaccination safety-they are less expensive and take a short time to produce. Thus, in order to quickly contain global pandemics such as the ongoing outbreak of coronavirus disease 2019 (COVID-19) caused by the severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), as well as epidemics and endemics, EBPVs are considered promising vaccine types. The high mutation rate of SARS-CoV-2 has posed a great challenge to public health worldwide because either the composition of existing vaccines has to be changed or a new vaccine has to be developed to protect against its different variants. In such scenarios, time being the critical factor, EBPVs can be a promising alternative. To design an effective and viable EBPV against different strains of a pathogen, it is important to identify the putative T- and B-cell epitopes. Using the wet-lab experimental approach to identify these epitopes is time-consuming and costly because the experimental screening of a vast number of potential epitope candidates is required. Fortunately, various available machine learning (ML)-based prediction methods have reduced the burden related to the epitope mapping process by decreasing the potential epitope candidate list for experimental trials. Moreover, these methods are also cost-effective, scalable, and fast. This paper presents a systematic review of various state-of-the-art and relevant ML-based methods and tools for predicting T- and B-cell epitopes. Special emphasis is placed on highlighting and analyzing various models for predicting epitopes of SARS-CoV-2, the causative agent of COVID-19. Based on the various methods and tools discussed, future research directions for epitope prediction are presented.Entities:
Keywords: COVID-19; SARS-CoV-2; antibody; antigen; antigenic determinant; ensemble model; epitope-based peptide vaccine; epitopes; immune-relevant determinants; machine learning
Year: 2022 PMID: 35215090 PMCID: PMC8879824 DOI: 10.3390/pathogens11020146
Source DB: PubMed Journal: Pathogens ISSN: 2076-0817
Figure 1Antigen recognition by antibodies.
Existing studies for T- and B-cell epitope prediction.
| Study Conducted | Methodology Adopted | Strengths/Limitations |
|---|---|---|
| T. Liu et al. [ | A feedforward deep neural network-based ensemble of 11 classifiers was created to predict BCEs. IEDB was used to obtain the BCE peptide dataset. On the test set, the model was evaluated using the AUROC metric. | Model reports peptide as an epitope if classified by all 11 classifiers. It would provide the best results if simple majority voting was used for classification. |
| Fatoba, A. J. et al. [ | In [ | The results of [ |
| R. Moody et al. [ | Authors used IEDB prediction tools for predicting B-cell epitopes and those with high scores in terms of prediction were selected as candidate epitopes. The epitopes were then matched to human proteins using NCBI Blast technology. | The findings showed eleven (11) novel B-cell epitopes in the host that were capable of explaining key elements of COVID-19 extrapulmonary disease that previous research had not been able to explain. |
| Jespersen MC et al. [ | The authors employed feedforward neural networks (FFNN) with two hidden layers, each with 25 neurons, an activation function (sigmoid) at all neurons, and an ADAM as an optimizing function to predict antibody-specific epitopes (B cell) or epitope targets of provided cognate antibodies. The dataset was obtained from the IEDB database. PCA was used for dimensionality reduction before the model was trained. | It was shown that a simple set of attributes retrieved from the cognate antibody boosted the rate of accuracy in predicting individual epitopes. Furthermore, sophisticated features such as Zernike Moments can improve the model’s predictive potential. When compared to DiscoTope 2.0, this model performs better in finding patches overlapping with an actual patch of an epitope in cross-validation and on an independent dataset. |
| Ling-yun Liu et al. [ | The authors used PCA and RNN networks. They converted the physicochemical properties into digital vectors, intending to have high-dimensional feature space, and later PCA was applied to process them. The output from PCA was used as an input to the RNN for predicting epitopes. | Prediction results obtained by this process demonstrated that PCA reduced dimensions, but at the same time, original features of the main component were retained, and the rate of prediction was also improved. |
| Bin Cheng et al. [ | Authors introduced a novel scale to measure feature importance, called the relevance of amino acid pair (RAAP). RAAP was calculated by decomposing the sequences of amino acids based on their physicochemical properties. | The successful prediction rate was drastically improved here by using LSTM. It does not suffer from gradient instability and is good enough for textual classification sequences. Fivefold cross-validation was used to test and validate the models. |
| Balachandran Manavalan et al. [ | Here, a non-redundant dataset was constructed containing 5500 BCEs experimentally validated, and 6893 non-B-cell epitopes were retrieved from IEDB. Then, an ensemble model to predict B-cell epitopes based on ERT (extremely randomized tree) and a classifier called GB (gradient boosting) was developed. The model works based on the physicochemical properties, AA composition, and combination of dipeptides and PCP as the input features. | After performing cross-validation on a benchmark dataset, it was shown that this model performed far better than the individual classifiers such as ERT and GB, with an MCC (Matthews correlation coefficient) of 0.454. |
| Yuh-Jyh Hu et al. [ | A cost-sensitive strategy based on bagging MDT was suggested, which integrates two ensemble-based learning algorithms. Without employing the prediction of a pre-trained single predictor, it makes it independent of multiple prediction tools. It can also learn a meta-classification architecture with varied data, without being constrained by a particular hierarchy. | It was demonstrated that the performance of prediction is superior as compared to a single epitope predictor. However, epitope prediction based on meta-learning is purely dependent upon the predictive strength of various other pre-trained linear and conformational epitope prediction tools, which cannot be retained directly by users. Hence, this limits the flexibility and applicability of these meta-classifiers. |
| Jing Ren et al. [ | The authors proposed a novel staged heterogeneity-based learning model. The model learns both heterogeneity and characteristics of data in a phased manner to identify residue of antigens of conformational B-cell type epitopes that are heterogeneous, purely based on sequences of antigens. In the first stage, the model is made to learn the generic epitope pattern with propensities, and in the second stage, the same model is made to learn the complementarity of the propensities used in the first stage, which is heterogeneous but this time on a small dataset of experimentally verified epitopes. | It was demonstrated that if heterogeneity was learned well, the transferability of the model improved remarkably in handling new data.It was tested and validated on two different datasets: one with epitopes determined experimentally and another with computationally defined. It showed outstanding performance that was around twice that of existing predictors, including CBTOPE. |
| Georgios A. et al. [ | A novel method, “SEPIa”, has been proposed here to predict B-cell epitopes from protein sequences and is sufficiently faster, and it can also be applied to large-scale datasets. The model is the combination of two classifiers, random forest and naïve Bayes algorithm. | The average prediction accuracy of SEPIa is limited. The AUC score is 0.65 in both 10-fold cross-validation and on the independent test dataset, which is higher than other approaches tested on the same test dataset. |
| Gene Sher et al. [ | Authors proposed a novel, analytically trained DREEP (Deep Ridge Regressed Epitope Predictor) based on string kernels using a deep neural network tailored to predict continuous epitopes. | The model was tested with input as long sequences of proteins from datasets such as AntiJen, Pellequer, and HIV. The results were compared with epitope predictors such as DMNLBE, LBtope, etc. Using the area under the curve (AUC) metric, the model achieved performance improvements over SARS by 13.7%, HIV by 8.9%, and Pellequer by 1.5%. |
| Wen Zhang et al. [ | Authors attempted to differentiate immunogenic epitopes from non-immunogenic epitopes based purely on their primary structure. To effectively utilize various features, an ensemble method based on a genetic algorithm was proposed. | The model was tested on two benchmark datasets: IMMA2, PAAQD. The model was compared with methods such as POPI, PAAQD, and POPISK, which are considered state-of-the-art in nature. The model performed better, with an AUC score on IMMA2 of 0.846 and 0.829 on PAAQD. |
| Wei Zheng et al. [ | The authors used ensemble learning to improve the prediction of BCEs. Their ensemble method combined twelve SVMs. To handle imbalanced datasets, resampling and AdaBoost methods were used. | The proposed ensemble model achieved an AUC score of 0.642–0.672 on the training dataset with five-fold cross-validation and an AUC score of 0.579–0.604 on the test dataset. |
| Jian Zhang et al. [ | To predict antigenic determinants, the authors devised a cost-sensitive ensemble approach, and a spatial clustering-based algorithm was used to identify probable epitopes. | The model performed admirably in terms of prediction. AUC scores of 0.721 and 0.703 were obtained using leave-one-out cross-validation (LOOCV) on two benchmark datasets: bound and unbound. |
| Kavitha K V et al. [ | PCA was used to reduce dimensions and to filter out the essential features; for prediction purposes, a random forest algorithm was used. | Experimental results showed that the random forest-based classifier had an improved prediction accuracy rate as compared to BCPred, AAP, etc. |
| Wen Zhang et al. [ | The authors used sequence-derived features and developed an ensemble model based on random forest to predict epitopes accurately. | The model was evaluated using the leave-one-out cross-validation procedure, and an AUC score of 0.687 and 0.651 on bound and unbound datasets was obtained. |
| Ping Chen et al. [ | Authors reviewed various prediction models for epitopes, such as models based on SVM, neural network, random forest, etc., to defend computational approaches in the prediction of epitopes as in silico methods require a lot of effort and time. | Apart from defending the computational approaches, it was also concluded that there is a limitation to current models as it is impossible to devise an exact model without having complete knowledge of the immune system, and current models are simply best at approximation. |
| Claus Lundegaard et al. [ | Here, an artificial neural network was used. The standard feedforward neural network with backpropagation was employed to predict epitopes. The dataset was retrieved from the SYFPEITHI database. | The model efficiently and accurately predicts MHC class I type peptides and outperforms the existing methods. |
Prediction tools for T-cell epitopes categorized based on the methods they use (CITATION).
| Tool Name | Web URL | MHC Class Prediction Supported (MHC I or MHC II or Both) | S | A | P | T |
|---|---|---|---|---|---|---|
|
| ||||||
| EpiDOCK [ | II | - | - | - | - | |
|
| ||||||
| Vaxign [ | Both | - | - | - | - | |
| PEPVAC [ | I | X | - | X | - | |
| EPISOPT [ | I | X | - | - | - | |
| MAPPP [ | I | X | - | X | - | |
| PREDIVAC [ | II | - | - | - | - | |
| SYFPEITHI [ | Both | - | - | - | - | |
| Rankpep [ | Both | - | - | X | - | |
|
| ||||||
| MotifScan [ | Both | X | - | - | - | |
|
| ||||||
| EpiJen [ | I | - | X | X | X | |
| Propred [ | II | X | X | - | - | |
| TEPITOPE [ | II | - | X | - | - | |
| Propred 1 [ | I | X | X | X | - | |
| BIMAS [ | I | - | X | - | - | |
|
| ||||||
| EpiTOP [ | II | - | X | - | - | |
| MHCPred [ | Both | - | X | - | - | |
|
| ||||||
| NetCTL [ | I | X | X | X | X | |
| MULTIPRED2 [ | Both | X | - | - | - | |
| NetMHC [ | I | - | X | - | - | |
| NetMHCpan [ | I | - | X | - | - | |
| NetMHCII [ | II | - | X | - | - | |
| NetMHCIIpan [ | II | - | X | - | - | |
| NHLApred [ | I | - | - | X | - | |
|
| ||||||
| IL4pred [ | II | - | - | - | - | |
| WAPP [ | I | - | - | X | X | |
| SVRMHC [ | Both | - | X | - | - | |
| SVMHC [ | Both | - | - | - | - | |
| MHC2PRED [ | II | - | - | - | - | |
|
| ||||||
| IEDB-MHCI [ | I | - | X | - | - | |
| IEDB-MHCII [ | II | - | X | - | - | |
S: Prediction of supertypes, A: Quantitative binding affinity, P: Proteasomal cleavage, T: TAP binding.
Figure 2Linear and conformational B-cell epitopes.
Prediction tools for B-cell epitopes.
| Tool Name | Web URL | Methodology Used |
|---|---|---|
| Prediction of Linear B-Cell Epitopes | ||
| BepiPred [ | Decision tree | |
| PEOPLE [ | Propensity scale | |
| LBtope [ | ANN | |
| SVMTriP [ | SVM | |
| BCPREDS [ | SVM | |
| ABCpred [ | ANN | |
|
| ||
| DiscoTope [ | Structure-based (SM) | |
| PEPITO [ | SM | |
| ElliPro [ | SM | |
| CEP [ | SM | |
| EPITOPIA [ | SM (Naïve Bayes) | |
| EPIPRED [ | SM (Docking, ASEP) | |
| EPSVR [ | SM | |
| PEPITOPE [ | Mimotope | |
| CBTOPE [ | SM (SVM) | |
| EpiSearch [ | Mimotope | |
Existing ML methods used in SARS-CoV-2 epitope prediction.
| Sr. No. | Method Name | Usage |
|---|---|---|
| 01 | NetMHC [ | To predict HLA I class or CD8+ SARS-CoV-2 T-cell epitopes |
| 02 | NetMHCpan [ | |
| 03 | NetCTLpan_1.1 [ | |
| 04 | NetMHC_4.0 [ | |
| 05 | HLAthena [ | |
| 06 | MHCflurry [ | |
| 07 | NetHMCII_2.3 [ | To predict HLA II class or CD4+ SARS-CoV-2 T-cell epitopes |
| 08 | NetMHCIIpan_3.0 [ | |
| 09 | NetMHCIIpan_4.0 [ | |
| 10 | NeonMHC2 [ | |
| 11 | MARIA [ |