| Literature DB >> 36046247 |
Yu Chen1, Sai Li1, Jifeng Guo1.
Abstract
Moonlighting proteins have at least two independent functions and are widely found in animals, plants and microorganisms. Moonlighting proteins play important roles in signal transduction, cell growth and movement, tumor inhibition, DNA synthesis and repair, and metabolism of biological macromolecules. Moonlighting proteins are difficult to find through biological experiments, so many researchers identify moonlighting proteins through bioinformatics methods, but their accuracies are relatively low. Therefore, we propose a new method. In this study, we select SVMProt-188D as the feature input, and apply a model combining linear discriminant analysis and basic classifiers in machine learning to study moonlighting proteins, and perform bagging ensemble on the best-performing support vector machine. They are identified accurately and efficiently. The model achieves an accuracy of 93.26% and an F-sorce of 0.946 on the MPFit dataset, which is better than the existing MEL-MP model. Meanwhile, it also achieves good results on the other two moonlighting protein datasets.Entities:
Keywords: bagging-SVM; linear discriminant analysis; machine learning; moonlighting proteins; protein recognition
Year: 2022 PMID: 36046247 PMCID: PMC9420859 DOI: 10.3389/fgene.2022.963349
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1The pipeline of our experiment, (A) benchmark dataset acquisition; (B) feature extraction; (C) model construction; (D) model evaluation.
Composition of the benchmark dataset.
| Organism | MPs | Non-MPs | ||
|---|---|---|---|---|
| Number | Percentage (%) | Number | Percentage (%) | |
| Human | 45 | 16.8 | 60 | 37.0 |
|
| 30 | 11.19 | 16 | 9.88 |
| Yeast | 27 | 10.1 | 34 | 20.9 |
| Mouse | 11 | 4.1 | 52 | 32.1 |
| Other | 155 | 57.81 | 0 | 0.0 |
| Total | 268 | 100 | 162 | 100 |
Eight physical and chemical properties of the 188-dimensions.
| Attribute | Division | ||
|---|---|---|---|
| hydrophobicity | Polar:RKEDQN | Neutral:GASTPHY | Hydrophobicity:CVLIMFW |
| Normalized van der waals volume | Small:GASCTPD | Medium:NVEQIL | Large:MHKFRYW |
| polarity | Low:LIFWCMVY | Medium:PATGS | High:HQRKNED |
| polarizability | Low:GASDT | Medium:GPNVEQIL | High:KMHFRYW |
| charge | Positive:KR | Neutral:ANCQGHILMFPSTWYV | Negative:DE |
| Secondary structure | Helix:EALMQKRH | Strand:VIYCWFT | Coil:GNPSD |
| Solvent accessibility | Buried:ALFCGIVW | Exposed:RKQEND | Intermediate:MPSTHY |
| Surface tension | Large:GQDNAHR | Medium:KTSEC | Small:ILMFPWYV |
FIGURE 2The diagram of LDA applied to a binary classification algorithm.
The results of 10-fold cross-validation using a variety of classifiers and hybrid features.
| Feature | Method | ACC (%) | Precision | Recall | F-score | AUC |
|---|---|---|---|---|---|---|
| Pse-AAC | KNN | 87.4419 | 0.885 | 0.923 | 0.901 | 0.863 |
| DT | 87.2093 | 0.892 | 0.909 | 0.898 | 0.865 | |
| MLP | 88.8372 | 0.898 | 0.931 | 0.912 | 0.878 | |
| RF | 85.3488 | 0.883 | 0.887 | 0.883 | 0.847 | |
| XGB | 86.0465 | 0.891 | 0.891 | 0.888 | 0.856 | |
| SVM | 87.907 | 0.9 | 0.913 | 0.904 | 0.872 | |
| SVMProt-188D | KNN | 91.3953 | 0.919 | 0.944 | 0.931 | 0.906 |
| DT | 91.1628 | 0.918 | 0.946 | 0.929 | 0.906 | |
| MLP | 92.5581 | 0.939 | 0.941 | 0.939 | 0.922 | |
| RF | 89.3023 | 0.917 | 0.911 | 0.912 | 0.891 | |
| XGB | 89.5349 | 0.92 | 0.911 | 0.914 | 0.893 | |
| SVM | 92.7907 | 0.943 | 0.942 | 0.942 | 0.925 | |
| Pse-PSSM | KNN | 85.8514 | 0.886 | 0.886 | 0.884 | 0.848 |
| DT | 84.4189 | 0.884 | 0.868 | 0.872 | 0.839 | |
| MLP | 86.5116 | 0.917 | 0.868 | 0.888 | 0.869 | |
| RF | 82.5581 | 0.858 | 0.862 | 0.858 | 0.815 | |
| XGB | 84.186 | 0.869 | 0.883 | 0.873 | 0.833 | |
| SVM | 87.6744 | 0.921 | 0.883 | 0.898 | 0.879 |
FIGURE 3The accuracy of different features in each classifier.
FIGURE 4ROC curves of different classifiers on SVMProt-188D.
FIGURE 5The performance of the model after and before the implementation of LDA.
The results of Bagging-SVM and Single SVM.
| Method | ACC (%) | Precision | Recall | F-score | AUC |
|---|---|---|---|---|---|
| SVM | 92.7907 | 0.943 | 0.942 | 0.942 | 0.925 |
| Bagging_SVM | 93.2558 | 0.944 | 0.949 | 0.946 | 0.928 |
Comparison with other methods.
| Method | ACC (%) | Precision | Recall | F-score | AUC |
|---|---|---|---|---|---|
| MPFit | 75 | * | * | 0.784 | * |
| MEL-MP | * | 0.895 | 0.893 | 0.892 | 0.947 |
| Shirafkan’s | 81.7 | 0.813 | * | 0.802 | 0.806 |
| Our | 92.7907 | 0.943 | 0.942 | 0.942 | 0.925 |
The results of other dataset on our model.
| Method | ACC (%) | Precision | Recall | F-score | AUC |
|---|---|---|---|---|---|
| Method1 | 91.1746 | 0.91 | 0.949 | 0.929 | 0.901 |
| Method2 | 91.4530 | 0.907 | 0.958 | 0.932 | 0.902 |
FIGURE 6The performance of the plant MPs dataset on our model.