| Literature DB >> 33265930 |
Ivan Dimitrov1, Nevena Zaharieva1, Irini Doytchinova1.
Abstract
The identification of protective immunogens is the most important and vigorous initial step in the long-lasting and expensive process of vaccine design and development. Machine learning (ML) methods are very effective in data mining and in the analysis of big data such as microbial proteomes. They are able to significantly reduce the experimental work for discovering novel vaccine candidates. Here, we applied six supervised ML methods (partial least squares-based discriminant analysis, k nearest neighbor (kNN), random forest (RF), support vector machine (SVM), random subspace method (RSM), and extreme gradient boosting) on a set of 317 known bacterial immunogens and 317 bacterial non-immunogens and derived models for immunogenicity prediction. The models were validated by internal cross-validation in 10 groups from the training set and by the external test set. All of them showed good predictive ability, but the xgboost model displays the most prominent ability to identify immunogens by recognizing 84% of the known immunogens in the test set. The combined RSM-kNN model was the best in the recognition of non-immunogens, identifying 92% of them in the test set. The three best performing ML models (xgboost, RSM-kNN, and RF) were implemented in the new version of the server VaxiJen, and the prediction of bacterial immunogens is now based on majority voting.Entities:
Keywords: immunogenicity prediction; machine learning; protective immunogens
Year: 2020 PMID: 33265930 PMCID: PMC7711804 DOI: 10.3390/vaccines8040709
Source DB: PubMed Journal: Vaccines (Basel) ISSN: 2076-393X
Figure 1Workflow of machine learning (ML) models development in the present study. The dataset of 317 bacterial immunogens and 317 bacterial non-immunogens was divided into training and test sets. The training set was used for model development, and the test set for validation. The proteins were encoded by E-descriptors and auto-cross covariance (ACC)-transformed into uniform vectors. The models were validated by receiver operating characteristic (ROC) statistics.
Summary of the performance of the machine learning (ML) models. TP—true positives; TN—true negatives; FP—false positives; FN—false negatives; AROC—area under the ROC curve (sensitivity vs. 1-specificity); APR—area under the PR curve (precision vs. recall), MCC—Matthews correlation coefficient; FS—feature selection.
| Model |
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||||
| Training set | 160 | 168 | 82 | 90 | 0.64 | 0.67 | 0.65 | 0.66 | 0.70 | 0.66 | 0.31 | 0.65 |
| Test set | 41 | 53 | 14 | 26 | 0.61 | 0.79 | 0.70 | 0.74 | 0.74 | 0.76 | 0.41 | 0.67 |
|
| ||||||||||||
|
| ||||||||||||
| Training set | 177 | 191 | 59 | 73 | 0.71 | 0.76 | 0.74 | 0.75 | 0.82 | 0.82 | 0.47 | 0.73 |
| Test set | 47 | 53 | 14 | 20 | 0.70 | 0.79 | 0.75 | 0.77 | 0.83 | 0.84 | 0.50 | 0.73 |
|
| ||||||||||||
| Training set | 185 | 190 | 60 | 65 | 0.74 | 0.76 | 0.75 | 0.76 | 0.82 | 0.82 | 0.50 | 0.75 |
| Test set | 48 | 55 | 12 | 19 | 0.72 | 0.82 | 0.77 | 0.80 | 0.85 | 0.83 | 0.54 | 0.76 |
|
| ||||||||||||
| Training set | 191 | 181 | 69 | 59 | 0.76 | 0.72 | 0.74 | 0.74 | 0.81 | 0.81 | 0.49 | 0.75 |
| Test set | 50 | 56 | 11 | 17 | 0.75 | 0.84 | 0.79 | 0.82 | 0.83 | 0.84 | 0.58 | 0.78 |
|
| ||||||||||||
| Training set | 174 | 199 | 51 | 76 | 0.70 | 0.80 | 0.75 | 0.77 | 0.75 | 0.69 | 0.49 | 0.73 |
| Test set | 49 | 56 | 11 | 18 | 0.73 | 0.84 | 0.78 | 0.82 | 0.78 | 0.73 | 0.57 | 0.77 |
|
| ||||||||||||
| Training set | 190 | 198 | 52 | 60 | 0.76 | 0.79 | 0.78 | 0.78 | 0.85 | 0.87 | 0.55 | 0.77 |
| Test set | 48 | 62 | 5 | 19 | 0.72 | 0.92 | 0.82 | 0.91 | 0.88 | 0.89 | 0.66 | 0.80 |
|
| ||||||||||||
| Training set | 178 | 179 | 71 | 72 | 0.71 | 0.72 | 0.71 | 0.72 | 0.79 | 0.80 | 0.43 | 0.71 |
| Test set | 56 | 50 | 17 | 11 | 0.84 | 0.75 | 0.79 | 0.77 | 0.86 | 0.88 | 0.58 | 0.80 |
Performance measures for VaxiJen v2.0 and VaxiJen v3.0.
| Performances’ Measure | VaxiJen v2.0 | VaxiJen v3.0 |
|---|---|---|
|
| 27,055 | 27,055 |
|
| 17,256 | 4825 |
|
| 63.78 | 17.83 |
|
| 76 | 80 |
|
| 1.2 | 4.5 |
|
| 0.00611 | 8.33x10-42 |