| Literature DB >> 35910662 |
Songbo Liu1, Chengmin Cui2, Huipeng Chen1, Tong Liu1.
Abstract
Phage has high specificity for its host recognition. As a natural enemy of bacteria, it has been used to treat super bacteria many times. Identifying phage proteins from the original sequence is very important for understanding the relationship between phage and host bacteria and developing new antimicrobial agents. However, traditional experimental methods are both expensive and time-consuming. In this study, an ensemble learning-based feature selection method is proposed to find important features for phage protein identification. The method uses four types of protein sequence-derived features, quantifies the importance of each feature by adding perturbations to the features to influence the results, and finally splices the important features among the four types of features. In addition, we analyzed the selected features and their biological significance.Entities:
Keywords: ensemble learning; feature selection; machine learning; phage; protein classification
Year: 2022 PMID: 35910662 PMCID: PMC9335128 DOI: 10.3389/fmicb.2022.932661
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 6.064
Figure 1Flowchart of the proposed method. (A) Feature extraction method includes extracting AAC, CTD, CKSAAP-1GAP, and RPSSM from the original protein sequences. (B) Resampling all samples and then calculating the importance of each feature component using the ensemble classification. (C) Selecting a subset of important features using the incremental method. (D) Predicting phage proteins on the testing set using the important feature components.
Figure 2The scatterplots of feature importance using different feature extraction methods.
Figure 3Accuracies derived from the incremental strategy using different feature extraction methods (The colored line shows the accuracy results of the 10-fold data).
Figure 4Accuracies derived from the incremental strategy using integrated features.
Average accuracy of a 10-fold cross-validation on the training set using different features.
|
|
|
|
|
|
|---|---|---|---|---|
| DTC, GNB, LR, MLP | AAC (20D) | 71.89 | 85.02 | 80.81 |
| MLP, KNN, DTC, LR | CTD (168D) | 69.00 | 86.12 | 80.41 |
| KNN, MLP, LR, GNB | CKSAAP_1gap (400D) | 70.78 | 80.71 | 77.51 |
| GNB, LDA, LR, MLP | RPSSM (110D) | 82.78 | 79.81 | 80.81 |
| MLP, GNB, LR, DTC, KNN | Concatenation (698D) | 56.67 | 90.63 | 79.79 |
| GNB, MLP, KNN, LR |
|
|
|
|
| GNB, KNN |
|
|
|
|
| LR, MLP, DTC, GNB, LDA |
|
|
|
|
| DTC, LR, MLP, LDA, KNN |
|
|
|
|
| GNB, LDA, LR |
|
|
|
|
Bold indicates the result of the processing of the features.
Accuracy on independent test sets using different kinds of features.
|
|
|
|
|
|
|---|---|---|---|---|
| MLP | AAC (20D) | 66.67 | 85.94 | 79.79 |
| MLP | CTD (168D) | 70.00 | 79.69 | 76.60 |
| GNB | CKSAAP_1gap (400D) | 63.33 | 85.94 | 78.72 |
| GNB | RPSSM (110D) | 73.33 | 76.56 | 75.53 |
|
|
| 56.67 | 90.63 | 79.79 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Bold indicates the result of the processing of the features.
Figure 5ROC curves of 87 features on independent test data.
Performance comparison of the different features in independent test sets.
|
|
|
|
|
|
|---|---|---|---|---|
| Naïve Bayes | Ding et al., | 75.76 | 80.77 | 79.15 |
| SVM | Ding et al., | 75.76 | 89.42 | 85.02 |
| Bin et al., | Nine feature groups (8D) | 50.00 | 92.19 | 78.72 |
|
|
|
|
|
|
|
|
|
|
|
|
Bold indicates the result of the processing of the features.