| Literature DB >> 35937988 |
Rui-Si Hu1, Jin Wu2, Lichao Zhang3, Xun Zhou4, Ying Zhang5.
Abstract
Computational prediction to screen potential vaccine candidates has been proven to be a reliable way to provide guarantees for vaccine discovery in infectious diseases. As an important class of organisms causing infectious diseases, pathogenic eukaryotes (such as parasitic protozoans) have evolved the ability to colonize a wide range of hosts, including humans and animals; meanwhile, protective vaccines are urgently needed. Inspired by the immunological idea that pathogen-derived epitopes are able to mediate the CD8+ T-cell-related host adaptive immune response and with the available positive and negative CD8+ T-cell epitopes (TCEs), we proposed a novel predictor called CD8TCEI-EukPath to detect CD8+ TCEs of eukaryotic pathogens. Our method integrated multiple amino acid sequence-based hybrid features, employed a well-established feature selection technique, and eventually built an efficient machine learning classifier to differentiate CD8+ TCEs from non-CD8+ TCEs. Based on the feature selection results, 520 optimal hybrid features were used for modeling by utilizing the LightGBM algorithm. CD8TCEI-EukPath achieved impressive performance, with an accuracy of 79.255% in ten-fold cross-validation and an accuracy of 78.169% in the independent test. Collectively, CD8TCEI-EukPath will contribute to rapidly screening epitope-based vaccine candidates, particularly from large peptide-coding datasets. To conduct the prediction of CD8+ TCEs conveniently, an online web server is freely accessible (http://lab.malab.cn/∼hrs/CD8TCEI-EukPath/).Entities:
Keywords: LightGBM; T-cell epitopes; eukaryotic pathogens; hybrid features; machine learning
Year: 2022 PMID: 35937988 PMCID: PMC9354802 DOI: 10.3389/fgene.2022.935989
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1A technology roadmap of the machine learning model proposed in this study.
FIGURE 2Analysis of amino acid sequence features. (A) Length distribution of the positive CD8+ T-cell epitopes. The horizontal axis represents the length of amino acids, and the vertical axis represents the number of epitopes in positive samples. (B) Distribution features of amino acid types with respect to the positive and negative CD8+ T-cell epitopes. The horizontal axis represents the twenty amino acids, and the vertical axis represents the occurrence frequency of an amino acid in all sequences.
The accuracy (Acc) results of a single feature descriptor classified by different machine learning algorithms.
| Feature descriptors | Classifiers and Acc values (%) | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Bagging | DT | KNN | LGBM | LR | NB | RF | SVM | ||
| Ten-fold cross-validation | AAC | 71.454 | 66.667 | 68.174 |
| 65.160 | 65.603 |
| 67.730 |
| ASDC | 72.252 | 65.160 | 67.642 |
| 66.223 | 66.755 | 74.468 | 74.291 | |
| CTDC | 67.199 | 62.145 | 66.933 |
| 64.628 | 61.259 | 68.351 | 68.351 | |
| CTDT | 66.667 | 60.638 | 64.805 |
| 63.032 | 60.372 | 67.908 | 66.401 | |
| CTDD | 71.986 | 64.982 | 69.326 |
| 66.312 | 66.401 | 72.606 | 68.883 | |
| GDPC | 64.894 | 59.309 | 60.638 | 65.514 | 59.929 | 57.624 |
| 63.209 | |
| GTPC | 67.287 | 62.057 | 63.564 | 68.174 | 60.284 | 59.663 |
| 65.071 | |
| Independent test | AAC |
| 67.606 | 73.592 | 76.056 | 66.901 | 65.493 | 74.648 | 67.606 |
| ASDC | 75.352 | 65.845 | 69.366 |
| 67.254 | 69.014 | 75.704 | 74.296 | |
| CTDC | 72.535 | 63.028 | 66.197 | 70.070 | 67.254 | 63.380 |
| 71.479 | |
| CTDT | 65.493 | 60.915 | 65.141 | 65.141 | 66.197 | 60.211 |
| 64.085 | |
| CTDD | 69.014 | 62.324 | 66.197 | 72.183 | 68.662 | 66.901 |
| 70.775 | |
| GDPC | 63.028 | 50.704 | 59.859 | 64.789 | 61.268 | 61.972 |
| 65.141 | |
| GTPC | 66.197 | 59.859 | 65.141 |
| 63.380 | 63.028 | 68.662 | 66.197 | |
The best Acc values to reflect the performance of different classifiers were highlighted in bold font.
The classification results of different hybrid feature combinations detected by the LGBM classifier.
| Hybrid features | Ten-fold cross-validation | Independent test | ||||||
|---|---|---|---|---|---|---|---|---|
| Acc (%) | MCC | Se (%) | Sp (%) | Acc (%) | MCC | Se (%) | Sp (%) | |
| AAC + ASDC + CCTD + GDTPC |
|
|
|
|
|
|
| 77.465 |
| ASDC + CCTD + GDTPC | 78.103 | 0.562 | 76.596 | 79.610 | 76.056 | 0.521 | 77.465 | 74.648 |
| AAC + ASDC + CCTD | 77.660 | 0.553 | 76.064 | 79.255 | 77.113 | 0.542 | 76.056 | 78.169 |
| AAC + CCTD + GDTPC | 77.305 | 0.546 | 76.596 | 78.014 | 75.352 | 0.507 | 74.648 | 76.056 |
| ASDC + CCTD | 77.482 | 0.550 | 76.950 | 78.014 | 76.761 | 0.535 | 78.169 | 75.352 |
| AAC + CCTD | 76.684 | 0.534 | 75.887 | 77.482 | 75.352 | 0.507 | 76.056 | 74.648 |
| ASDC + GDTPC | 76.152 | 0.523 | 74.468 | 77.837 | 75.704 | 0.515 | 72.535 |
|
| CCTD + GDTPC | 76.064 | 0.522 | 74.468 | 77.660 | 77.465 | 0.550 |
| 76.056 |
| AAC + ASDC | 75.621 | 0.512 | 75.000 | 76.241 | 72.535 | 0.451 | 70.423 | 74.648 |
| AAC + GDTPC | 74.911 | 0.499 | 73.227 | 76.596 | 77.817 | 0.556 | 76.761 |
|
| CCTD | 75.621 | 0.513 | 74.645 | 76.596 | 74.648 | 0.495 |
| 70.423 |
| GDTPC | 69.592 | 0.392 | 68.262 | 70.922 | 72.535 | 0.451 | 71.831 | 73.239 |
The best metric values were highlighted in bold font.
A comparison of classification results by a pairwise combination of two feature selection techniques (MRMD and LGBM) and three optimal classifiers (LGBM, RF, and Bagging).
| Method | Ten-fold cross-validation | Independent test | ||||||
|---|---|---|---|---|---|---|---|---|
| Acc (%) | MCC | Se (%) | Sp (%) | Acc (%) | MCC | Se (%) | Sp (%) | |
| MRMD + LGBM |
|
| 77.837 |
|
|
|
|
|
| LGBM + LGBM | 78.457 | 0.569 |
| 78.901 | 77.113 | 0.542 | 77.465 | 76.761 |
| MRMD + RF | 75.887 | 0.518 | 73.404 | 78.369 | 74.648 | 0.493 | 72.535 | 76.761 |
| LGBM + RF | 75.355 | 0.507 | 74.645 | 76.064 | 74.648 | 0.493 | 73.944 | 75.352 |
| MRMD + Bagging | 73.316 | 0.466 | 72.163 | 74.468 | 75.352 | 0.507 | 73.239 | 77.465 |
| LGBM + Bagging | 74.202 | 0.484 | 72.695 | 75.709 | 75.704 | 0.514 | 76.056 | 75.352 |
The best metric values were highlighted in bold font.
FIGURE 3A comparison of the AUC curve in ten-fold cross-validation (A) and independent test (B). Results were by a pairwise combination of two feature selection techniques (MRMD and LGBM) and three optimal classifiers (LGBM, RF, and Bagging).
FIGURE 4The optimal feature sets selected by LGBM feature importance ranking (A) and a well-established MRMD strategy (B). The horizontal axis represents the number of selected features, and the vertical axis represents the accuracy value calculated by the LGBM classifier.