| Literature DB >> 30103458 |
Jiu-Xin Tan1, Fu-Ying Dao2, Hao Lv3, Peng-Mian Feng4, Hui Ding5.
Abstract
Accurate identification of phage virion protein is not only a key step for understanding the function of the phage virion protein but also helpful for further understanding the lysis mechanism of the bacterial cell. Since traditional experimental methods are time-consuming and costly for identifying phage virion proteins, it is extremely urgent to apply machine learning methods to accurately and efficiently identify phage virion proteins. In this work, a support vector machine (SVM) based method was proposed by mixing multiple sets of optimal g-gap dipeptide compositions. The analysis of variance (ANOVA) and the minimal-redundancy-maximal-relevance (mRMR) with an increment feature selection (IFS) were applied to single out the optimal feature set. In the five-fold cross-validation test, the proposed method achieved an overall accuracy of 87.95%. We believe that the proposed method will become an efficient and powerful method for scientists concerning phage virion proteins.Entities:
Keywords: ANOVA; feature fusion; mRMR; machine learning; phage virion protein
Mesh:
Substances:
Year: 2018 PMID: 30103458 PMCID: PMC6222849 DOI: 10.3390/molecules23082000
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1The framework of the proposed method.
Figure 2A plot showing the IFS curves for 0-gap to 9-gap.
The maximum Acc and the corresponding number of feature at different g values.
|
| Number of Feature | |
|---|---|---|
| 0 | 107 | 83.06 |
| 1 | 213 | 84.69 |
| 2 | 135 | 83.39 |
| 3 | 87 | 81.76 |
| 4 | 42 | 80.78 |
| 5 | 89 | 84.04 |
| 6 | 70 | 82.41 |
| 7 | 174 | 82.73 |
| 8 | 255 | 82.41 |
| 9 | 94 | 83.06 |
Figure 3A heat map for 1266 features based on different Z-scores.
Figure 4A plot showing the IFS curve by using mRMR.
Figure 5The ROC curve for the prediction of phage virion proteins by using 368 optimal features. The auROC of 0.915 was obtained in a five-fold cross-validation test. The diagonal dot line denotes a random guess with the auROC of 0.5.
Comparing the proposed method with other published methods.
| Ref. |
|
|
|
| auROC |
|---|---|---|---|---|---|
| [ | 75.76 | 80.77 | 79.15 | - | 0.855 |
| [ | 75.76 | 89.42 | 85.02 | - | 0.899 |
| [ | 73.70 | 93.30 | 87.00 | 0.695 | 0.900 |
| [ | 96.97 | 98.56 | 98.05 | 0.963 | 0.990 |
| This work | 83.83 | 89.90 | 87.95 | 0.761 | 0.915 |
Comparing the proposed method with other published methods on the independent dataset.
| Ref. |
|
|
|
| auROC |
|---|---|---|---|---|---|
| [ | 60.00 | 76.50 | 71.30 | 0.357 | 0.742 |
| [ | 66.70 | 85.90 | 79.80 | 0.531 | 0.844 |
| This work | 70.00 | 78.13 | 75.53 | 0.464 | 0.651 |