| Literature DB >> 35783287 |
Liwen Wu1,2, Song Gao1,2, Shaowen Yao1,2, Feng Wu1,2, Jie Li1,2, Yunyun Dong1,2, Yunqi Zhang1,2,3.
Abstract
Identifying the subcellular localization of a given protein is an essential part of biological and medical research, since the protein must be localized in the correct organelle to ensure physiological function. Conventional biological experiments for protein subcellular localization have some limitations, such as high cost and low efficiency, thus massive computational methods are proposed to solve these problems. However, some of these methods need to be improved further for protein subcellular localization with class imbalance problem. We propose a new model, generating minority samples for protein subcellular localization (Gm-PLoc), to predict the subcellular localization of multi-label proteins. This model includes three steps: using the position specific scoring matrix to extract distinguishable features of proteins; synthesizing samples of the minority category to balance the distribution of categories based on the revised generative adversarial networks; training a classifier with the rebalanced dataset to predict the subcellular localization of multi-label proteins. One benchmark dataset is selected to evaluate the performance of the presented model, and the experimental results demonstrate that Gm-PLoc performs well for the multi-label protein subcellular localization.Entities:
Keywords: class imbalance learning; deep learning; generative adversarial networks; multi-label classification; protein subcellular localization
Year: 2022 PMID: 35783287 PMCID: PMC9240597 DOI: 10.3389/fgene.2022.912614
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1Predicting subcellular localization of a given protein by Gm-PLoc.
FIGURE 2An illustration of SM-GAN.
FIGURE 3The model architecture of ML-DeepFM.
FIGURE 4The sample proportion of the benchmark dataset.
FIGURE 5The relationship between SM-GAN with ML-DeepFM.
FIGURE 6The experimental results come from the exhaustive combination of SM-GAN and ML-DeepFM.
Comparing SM-GAN with other oversampling methods.
| Method for imbalanced learning | Hl | Co | Oe | Ap | Rl |
|---|---|---|---|---|---|
| SMOTE | 0.106 | 2.050 | 0.552 | 0.626 | 0.121 |
| Borderline-SMOTE | 0.102 | 2.138 | 0.521 | 0.646 | 0.126 |
| SVM-Balance | 0.105 | 1.998 | 0.546 | 0.635 | 0.117 |
| SinGAN | 0.089 | 1.983 | 0.463 | 0.684 | 0.116 |
| DCGAN | 0.083 | 1.669 | 0.452 | 0.696 | 0.102 |
| SM-GAN |
|
|
|
|
|
Comparing Gm-PLoc with other model of protein subcellular localization.
| Model of subcellular localization | Co | Ap | Rl |
|---|---|---|---|
| GO + AAC + PseAAC + IMMMLGP | 4.303 | 0.581 | 0.419 |
| GO + FunD + PSSM + OET-KNN | 5.317 | 0.579 | 0.496 |
| PSSM + PseAAC + Multi-SVM | 1.719 | 0.706 | 0.108 |
| PSSM + SM-GAN + ML-DeepFM |
|
|
|
The bold values provided in Table 2 mean the best results calculated with different protein subcellular localization methods.