| Literature DB >> 32793224 |
Jayvee R Abella1, Dinler A Antunes1, Cecilia Clementi2,3, Lydia E Kavraki1.
Abstract
Prediction of stable peptide binding to Class I HLAs is an important component for designing immunotherapies. While the best performing predictors are based on machine learning algorithms trained on peptide-HLA (pHLA) sequences, the use of structure for training predictors deserves further exploration. Given enough pHLA structures, a predictor based on the residue-residue interactions found in these structures has the potential to generalize for alleles with little or no experimental data. We have previously developed APE-Gen, a modeling approach able to produce pHLA structures in a scalable manner. In this work we use APE-Gen to model over 150,000 pHLA structures, the largest dataset of its kind, which were used to train a structure-based pan-allele model. We extract simple, homogenous features based on residue-residue distances between peptide and HLA, and build a random forest model for predicting stable pHLA binding. Our model achieves competitive AUROC values on leave-one-allele-out validation tests using significantly less data when compared to popular sequence-based methods. Additionally, our model offers an interpretation analysis that can reveal how the model composes the features to arrive at any given prediction. This interpretation analysis can be used to check if the model is in line with chemical intuition, and we showcase particular examples. Our work is a significant step toward using structure to achieve generalizable and more interpretable prediction for stable pHLA binding.Entities:
Keywords: HLA-I; antigen presentation; docking; immunopeptidomics; machine learning; peptide binding; random forests; structural modeling
Mesh:
Substances:
Year: 2020 PMID: 32793224 PMCID: PMC7387700 DOI: 10.3389/fimmu.2020.01583
Source DB: PubMed Journal: Front Immunol ISSN: 1664-3224 Impact factor: 7.561
Figure 1Overview of the method to go from sequence to structure-based features for classification. APE-Gen is used to model pHLA structures, then featurization is done by extracting the residue-residue interactions between peptide and HLA. The final random forest model is trained on these structure-based features.
Average AUROC values from five-fold validation tests across different classifiers and different featurizations.
| rf | 1/ | 0.978 (0.000) |
| rf | 1/ | 0.976 (0.001) |
| rf | sig | 0.975 (0.001) |
| gb | 1/ | 0.970 (0.002) |
| gb | 1/ | 0.970 (0.001) |
| gb | sig | 0.977 (0.001) |
| lr | 1/ | 0.875 (0.003) |
| lr | 1/ | 0.880 (0.002) |
| lr | sig | 0.882 (0.001) |
Only the best parameters per classifier are shown. rf stands for random forest, gb stands for gradient boosting, and lr stands for logistic regression. Average AUROC values are reported along with standard deviations. Random forest classifiers produce the most robust models.
Average AUROC values from five-fold validation tests across different featurizations and different datasets.
| 1/ | Single | 0.978 (0.000) |
| 1/ | Single | 0.976 (0.001) |
| sig | Single | 0.975 (0.001) |
| 1/ | Ensemble | 0.987 (0.001) |
| 1/ | Ensemble | 0.988 (0.000) |
| sig | Ensemble | 0.990 (0.000) |
Average AUROC values are reported along with standard deviations. The best model uses sigmoid-based features trained on the ensemble-enriched dataset.
Figure 2Comparison of AUROC values for leave-one-allele-out experiments. Our structural method based on using sigmoid featurization and ensemble-enriched dataset achieves competitive results with sequence-based methods.