| Literature DB >> 35547252 |
Dong Chen1, Yanjuan Li1.
Abstract
The major histocompatibility complex (MHC) is a large locus on vertebrate DNA that contains a tightly linked set of polymorphic genes encoding cell surface proteins essential for the adaptive immune system. The groups of proteins encoded in the MHC play an important role in the adaptive immune system. Therefore, the accurate identification of the MHC is necessary to understand its role in the adaptive immune system. An effective predictor called PredMHC is established in this study to identify the MHC from protein sequences. Firstly, PredMHC encoded a protein sequence with mixed features including 188D, APAAC, KSCTriad, CKSAAGP, and PAAC. Secondly, three classifiers including SGD, SMO, and random forest were trained on the mixed features of the protein sequence. Finally, the prediction result was obtained by the voting of the three classifiers. The experimental results of the 10-fold cross-validation test in the training dataset showed that PredMHC can obtain 91.69% accuracy. Experimental results on comparison with other features, classifiers, and existing methods showed the effectiveness of PredMHC in predicting the MHC.Entities:
Keywords: feature extraction; identification; machine learning; major histocompatibility complex; protein classification
Year: 2022 PMID: 35547252 PMCID: PMC9081368 DOI: 10.3389/fgene.2022.875112
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1Framework of PredMHC.
Result of different features on Train-10790.
| Feaures | ACC | MCC | SE | SP |
|---|---|---|---|---|
| (1)-188D | 0.8953 | 0.7927 | 0.8596 | 0.9310 |
| (2)-APAAC | 0.8329 | 0.6824 | 0.9494 | 0.7108 |
| (3)-KSCTriad | 0.8764 | 0.7580 | 0.8177 | 0.9350 |
| (4)-CKSAAGP | 0.8682 | 0.7469 | 0.7826 | 0.9529 |
| (5)-PAAC | 0.8283 | 0.6739 | 0.9485 | 0.7018 |
| 188D + APAAC | 0.9003 | 0.8019 | 0.8735 | 0.9276 |
| APAAC + KSCTriad | 0.8872 | 0.7782 | 0.8386 | 0.9360 |
| KSCTriad + CKSAAGP | 0.8993 | 0.8039 | 0.8404 | 0.9576 |
| CKSAAGP + PAAC | 0.8848 | 0.7728 | 0.8376 | 0.9316 |
| 188D + APAAC + KSCTriad | 0.9121 | 0.8268 | 0.8734 | 0.9511 |
| APAAC + KSCTriad + CKSAAGP | 0.9054 | 0.8155 | 0.8518 | 0.9589 |
| KSCTriad + CKSAAGP + PAAC | 0.9041 | 0.8127 | 0.8516 | 0.9565 |
| 188D + APAAC + KSCTriad + CKSAAGP | 0.9157 | 0.8351 | 0.8701 | 0.9618 |
| APAAC + KSCTriad + CKSAAGP + PAAC | 0.9065 | 0.8178 | 0.8522 | 0.9608 |
| Our mixed feature | 0.9169 | 0.8370 | 0.8761 | 0.9587 |
Result of different classifiers on Train-10790.
| Classifiers | ACC | MCC | SE | SP |
|---|---|---|---|---|
| SGD | 0.8794 | 0.7600 | 0.8504 | 0.9081 |
| SMO | 0.9038 | 0.8106 | 0.8594 | 0.9478 |
| Random forest | 0.8850 | 0.7699 | 0.8830 | 0.8869 |
| Our classification model | 0.9169 | 0.8370 | 0.8761 | 0.9587 |
Result of different features on Test-2698.
| Features | ACC | MCC | SE | SP |
|---|---|---|---|---|
| 188D | 0.8926 | 0.7869 | 0.8593 | 0.9259 |
| APAAC | 0.8357 | 0.6892 | 0.9533 | 0.7139 |
| KSCTriad | 0.8741 | 0.7504 | 0.8355 | 0.9127 |
| CKSAAGP | 0.8774 | 0.7614 | 0.8098 | 0.9442 |
| PAAC | 0.8326 | 0.6826 | 0.9527 | 0.7056 |
| 188D + APAAC | 0.9010 | 0.8061 | 0.8482 | 0.9530 |
| APAAC + KSCTriad | 0.8940 | 0.7888 | 0.8697 | 0.9182 |
| KSCTriad + CKSAAGP | 0.9055 | 0.8155 | 0.8540 | 0.9573 |
| CKSAAGP + PAAC | 0.8901 | 0.7818 | 0.8571 | 0.9230 |
| 188D + APAAC + KSCTriad | 0.9172 | 0.8355 | 0.8938 | 0.9412 |
| APAAC + KSCTriad + CKSAAGP | 0.9130 | 0.8287 | 0.8729 | 0.9532 |
| KSCTriad + CKSAAGP + PAAC | 0.9155 | 0.8337 | 0.8769 | 0.9544 |
| 188D + APAAC + KSCTriad + CKSAAGP | 0.9198 | 0.8416 | 0.8841 | 0.9550 |
| APAAC + KSCTriad + CKSAAGP + PAAC | 0.9134 | 0.8300 | 0.8693 | 0.9574 |
| Our mixed feature | 0.9246 | 0.8502 | 0.9034 | 0.9466 |
Result of different classifiers on Test-2698.
| Classifier | ACC | MCC | SE | SP |
|---|---|---|---|---|
| SGD | 0.8959 | 0.7918 | 0.8935 | 0.8982 |
| SMO | 0.9063 | 0.8147 | 0.8682 | 0.9440 |
| Random forest | 0.8948 | 0.7896 | 0.8913 | 0.8982 |
| Our classification model | 0.9246 | 0.8502 | 0.9034 | 0.9466 |
Comparison of 10-fold cross-validation with the existing method on all data.
| Method | ACC | MCC | SE | SP |
|---|---|---|---|---|
| ELM-MHC | 0.9166 | 0.822 | 0.893 | 0.908 |
| Our method | 0.9185 | 0.8403 | 0.8741 | 0.9627 |