| Literature DB >> 31491969 |
Cristian R Munteanu1,2,3, Marcos Gestal4,5, Yunuen G Martínez-Acevedo1,6, Nieves Pedreira1, Alejandro Pazos1,2, Julián Dorado1,3.
Abstract
In this work, we improved a previous model used for the prediction of proteomes as new B-cell epitopes in vaccine design. The predicted epitope activity of a queried peptide is based on its sequence, a known reference epitope sequence under specific experimental conditions. The peptide sequences were transformed into molecular descriptors of sequence recurrence networks and were mixed under experimental conditions. The new models were generated using 709,100 instances of pair descriptors for query and reference peptide sequences. Using perturbations of the initial descriptors under sequence or assay conditions, 10 transformed features were used as inputs for seven Machine Learning methods. The best model was obtained with random forest classifiers with an Area Under the Receiver Operating Characteristics (AUROC) of 0.981 ± 0.0005 for the external validation series (five-fold cross-validation). The database included information about 83,683 peptides sequences, 1448 epitope organisms, 323 host organisms, 15 types of in vivo processes, 28 experimental techniques, and 505 adjuvant additives. The current model could improve the in silico predictions of epitopes for vaccine design. The script and results are available as a free repository.Entities:
Keywords: epitopes; machine learning; protein sequences; qualitative structure–activity relationships
Mesh:
Substances:
Year: 2019 PMID: 31491969 PMCID: PMC6770149 DOI: 10.3390/ijms20184362
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Area Under the Receiver Operating Characteristics (AUROC) values for seven Machine Learning (ML) methods (five-fold cross-validation (CV)).
| ML | Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 | Mean | SD |
|---|---|---|---|---|---|---|---|
| KNN | 0.900 | 0.898 | 0.900 | 0.898 | 0.897 | 0.899 | 0.0013 |
| SVM linear | 0.785 | 0.800 | 0.789 | 0.799 | 0.790 | 0.792 | 0.0067 |
| SVM | 0.866 | 0.863 | 0.864 | 0.866 | 0.862 | 0.864 | 0.0019 |
| LR | 0.818 | 0.816 | 0.816 | 0.816 | 0.814 | 0.816 | 0.0015 |
| DT | 0.923 | 0.923 | 0.923 | 0.923 | 0.923 | 0.923 | 0.0003 |
| RF | 0.974 | 0.973 | 0.973 | 0.972 | 0.971 |
|
|
| XGB | 0.892 | 0.890 | 0.890 | 0.889 | 0.887 | 0.890 | 0.0017 |
ML = Machine Learning; SD = standard deviation; KNN = KNeighborsClassifier, SVM linear = SVC (kernel=“linear”), SVM = SVC (kernel=“rbf”), LR = LogisticRegression, DT = DecisionTreeClassifier, RF = RandomForestClassifier, XGB = XGBClassifier; the best AUROC value and the corresponding SD are bolded.
Figure 1Box-plot for AUROC values of ML classifiers (five-fold CV).
Figure 2Box-plot for AUROC values of RF classifiers with different trees (five-fold CV). RFn = Random Forest with n trees (n = 5, 10, 20, 30, 40, 50, 100, 200, 500, 1000).
Figure 3Feature importance for the best RF classifier.
Figure 4Methodology flow for building models to predict epitope activity level.