| Literature DB >> 22761950 |
Jianzhao Gao1, Eshel Faraggi, Yaoqi Zhou, Jishou Ruan, Lukasz Kurgan.
Abstract
Accurate identification of immunogenic regions in a given antigen chain is a difficult and actively pursued problem. Although accurate predictors for T-cell epitopes are already in place, the prediction of the B-cell epitopes requires further research. We overview the available approaches for the prediction of B-cell epitopes and propose a novel and accurate sequence-based solution. Our BEST (B-cell Epitope prediction using Support vector machine Tool) method predicts epitopes from antigen sequences, in contrast to some method that predict only from short sequence fragments, using a new architecture based on averaging selected scores generated from sliding 20-mers by a Support Vector Machine (SVM). The SVM predictor utilizes a comprehensive and custom designed set of inputs generated by combining information derived from the chain, sequence conservation, similarity to known (training) epitopes, and predicted secondary structure and relative solvent accessibility. Empirical evaluation on benchmark datasets demonstrates that BEST outperforms several modern sequence-based B-cell epitope predictors including ABCPred, method by Chen et al. (2007), BCPred, COBEpro, BayesB, and CBTOPE, when considering the predictions from antigen chains and from the chain fragments. Our method obtains a cross-validated area under the receiver operating characteristic curve (AUC) for the fragment-based prediction at 0.81 and 0.85, depending on the dataset. The AUCs of BEST on the benchmark sets of full antigen chains equal 0.57 and 0.6, which is significantly and slightly better than the next best method we tested. We also present case studies to contrast the propensity profiles generated by BEST and several other methods.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22761950 PMCID: PMC3384636 DOI: 10.1371/journal.pone.0040104
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary of the considered features and features selected and used in the proposed sequence-based predictor of B-cell epitopes.
| Feature group | Abbreviated name | Number of features | Number of selected features |
| Predicted secondary structure (SS) | SS | 8 | 2 |
| Predicted RSA | RA | 33 | 5 |
| RAAP score | RP | 30 | 24 |
| Conservation score | CS | 29 | 2 |
| Predicted SS and RSA | SS+RA | 12 | 6 |
| Predicted SS and conservation score | SS+CS | 6 | 1 |
| Predicted SS and RAAP score | SS+RP | 6 | 1 |
| RAAP score and predicted RSA | RP+RA | 30 | 17 |
| RAAP and conservation scores | RP+CS | 28 | 18 |
| Predicted SS and RSA, and RAAP score | SS+RA+RP | 6 | 1 |
| Similarity score | SIM | 10 | 7 |
| Total number of features | 198 | 84 |
Comparison of predictive quality on the ChenFrag dataset calculated using either 10-fold cross validation or 5-fold cross validation to match the test type from the corresponding manuscripts. The methods are sorted by their AUC values in the ascending order.
| Method | AUC | Accuracy | Sensitivity | Specificity | Precision | F-measure | MCC |
| Chen et al. | unavailable | 0.725 | 0.636 | 0.765 | 0.701 | 0.667 | 0.40 |
| SVM model 198 | 0.835 | 0.783 | 0.587 | 0.979 | 0.966 | 0.730 | 0.62 |
| COBEpro | 0.829 | 0.780 | 0.609 | 0.951 | 0.925 | 0.734 | 0.59 |
| SVM model 198 | 0.840 | 0.792 | 0.597 | 0.987 | 0.979 | 0.742 | 0.63 |
| SVM model 84 | 0.848 | 0.788 | 0.579 | 0.998 | 0.996 | 0.732 | 0.63 |
The methods are sorted by their AUC values in the ascending order.
results based on 5-fold cross validation from Table 3 in [20].
results based on 5-fold cross validation for the SVM model (C = 8.0 and gamma = 0.000977) that uses all 198 features.
results based on 10-fold cross validation from Table I in [22].
results based on 10-fold cross validation for the SVM model (C = 8.0 and gamma = 0.000977) that uses all 198 features.
results based on 10-fold cross validation for the SVM model (C = 1.0 and gamma = 0.001953) that uses the selected 84 features.
Figure 1Overall design of the proposed BEST method.
Comparison of predictive quality on the BCPREDFrag dataset calculated using 10-fold cross validation. The methods are sorted by their AUC values in the ascending order.
| Method | AUC | Accuracy | Sensitivity | Specificity | Precision | F-measure | MCC |
| Chen et al. | 0.700 | 0.641 | 0.529 | 0.752 | 0.681 | 0.596 | 0.29 |
| BCPred | 0.758 | 0.679 | 0.726 | 0.632 | 0.664 | 0.694 | 0.36 |
| COBEpro | 0.768 | 0.714 | 0.554 | 0.874 | 0.815 | 0.660 | 0.45 |
| SVM model 198 | 0.811 | 0.745 | 0.561 | 0.929 | 0.887 | 0.687 | 0.53 |
| SVM model 84 | 0.813 | 0.740 | 0.495 | 0.984 | 0.969 | 0.655 | 0.55 |
results from Table 1 in [21].
results from Table II in [22].
results for the SVM model (C = 8.0 and gamma = 0.000977) that uses all 198 features.
results for the SVM model (C = 1.0 and gamma = 0.001953) that uses the selected 84 features.
Figure 2Receiver operating characteristic (ROC) curves for the SVM model with 84 features, RAAP and MaxSimilarity models.
The curves were computed based on the 10-fold cross validation on the BCPREDFrag dataset (panel A) and ChenFrag dataset (panel B).
AUC values on the BCPREDFrag and ChenFrag datasets calculated using 10-fold cross validation obtained by using selected features from individual feature groups; abbreviates names of feature groups are given in Table 1.
| Dataset | SS | RA | RP | CS | SS+RA | SS+CS | SS+RP | RP+RA | RP+CS | SS+RA+RP | SIM |
| BCPREDFrag | 0.557 | 0.542 | 0.716 | 0.501 | 0.602 | 0.568 | 0.532 | 0.695 | 0.710 | 0.556 | 0.760 |
| ChenFrag | 0.565 | 0.547 | 0.743 | 0.496 | 0.584 | 0.545 | 0.555 | 0.738 | 0.743 | 0.560 | 0.824 |
Figure 3The values of the similarity-based scores between the 20-mers from the BCPREDFrag dataset and the library of the epitope fragments, i.e., the max_similarity_epitope feature.
The black line shows the similarity scores for the native epitope and the gray line for the non-epitope fragments. The x-axis corresponds to the sorted list (in the ascending order based on the similarity scores) of the 701 epitopic and 701 non-epitopic 20-mers from the BCPREDFrag dataset, and the y-axis shows their corresponding similarity scores.
Figure 4The AUC and success rate values in the function of the number of selected scores k (x-axis) when using SVM model with 84 features and the distance scheme to predict B-cell epitopes on the SEQ194 dataset.
We use the Filtered40_BCPREDFrag to generate the SVM model.
The AUC and success rate for the prediction of the B-cell epitopes on the SEQ194 dataset when using predictions from the SVM model with 84 features and the five schemes: maximum, average, median, and distance scheme with k = 10 and k = 16. We use the Filtered40_BCPREDFrag to generate the SVM model.
| Method | Success rate | AUC |
| Max scheme | 47.4% | 0.52 |
| Average scheme | 56.2% | 0.56 |
| Median scheme | 60.8% | 0.55 |
| Distance scheme | 58.8% | 0.57 |
| Distance scheme | 60.3% | 0.57 |
Comparison of the proposed BEST method with existing B-cell epitope predictors on the SEQ149 dataset.
| Category | Method | Success rate | AUC | Significance of improvement in AUC | |
| compared to BEST16
| compared to BEST10
| ||||
| Structure- based | Epitopia | 80.4% | 0.59 | unavailable | unavailable |
| Epitopia | 73.7% | 0.57 | − | − | |
| Sequence- based | ABCPred | 67.0% | 0.55 | unavailable | unavailable |
| ABCPred | 61.9% | 0.53 | + | + | |
| BayesB | 80.9% | unavailable | unavailable | unavailable | |
| CBTOPE | 45.9% | 0.52 | + | + | |
| COBEpro | 66.9% | 0.55 | unavailable | unavailable | |
| COBEpro | 66.3% | 0.54 | + | + | |
| BEST 10
| 58.8% | 0.57 | |||
| BEST 16
| 60.3% | 0.57 | |||
The methods are sorted alphabetically within each category. We evaluate significance of differences between BEST16 (BEST10) and the other methods. We compare the corresponding AUC values in 10 paired results based on 100 random selected chains from the SEQ194 dataset using paired t-test; +/– mean that BEST16 (BEST10) are significantly better/worse that another method at p-value <0.05.
results from [34].
results from the Epitopia web server at http://epitopia.tau.ac.il/.
results from the ABCPred web server http://www.imtech.res.in/raghava/abcpred/.
results from the BayesB web server at http://www.immunopred.org/bayesb/index.html.
results from the CBTOPE web server at http://www.imtech.res.in/raghava/cbtope/.
results from the COBEpro web server at http://scratch.proteomics.ics.uci.edu/.
results generated using BEST method, which is based on the SVM model (C = 1.0 and gamma = 0.001953) with 84 features generated with the Filtered40_BCPREDFrag dataset and the distance scheme with k = 16 (BEST16) and with k = 10 (BEST10).
Figure 5The average AUC values estimated using SEQ194 dataset.
The values were calculated over the 10 repetitions using 100 randomly selected chains from the SEQ194 dataset (shown using gray bars) and the corresponding standard deviations (shown using black error bars) for the considered B-cell epitope predictors.
Figure 6Receiver operating characteristic (ROC) curves of the considered B-cell epitope predictors on the SEQ194 dataset.
Figure 7Receiver operating characteristic (ROC) curves of the considered B-cell epitope predictors on the SEQ19 dataset.
Figure 8Residue epitopic propensities predicted by ABCPred, COBEpro, Epitopia and BEST for a capsid protein (UniProt ID: P16489; panel A) and an anti-repression transactivator protein (UniProt ID: P20869; panel B).
The plots also include the location of the native epitopes. The x-axis shows the protein chain and the location of the native epitopes (denoted with black horizontal line) and y-axis shows the values of the predicted propensities. The left y-axis gives the propensities for ABCpred, COBEpro and Epitopis and the right y-axis for BEST.