| Literature DB >> 33828582 |
Ying Li1, Jianing Zhao1, Zhaoqian Liu2, Cankun Wang2, Lizheng Wei2, Siyu Han3, Wei Du1.
Abstract
Moonlighting proteins (MPs) are a special type of protein with multiple independent functions. MPs play vital roles in cellular regulation, diseases, and biological pathways. At present, very few MPs have been discovered by biological experiments. Due to the lack of data sample, computation-based methods to identify MPs are limited. Currently, there is no de-novo prediction method for MPs. Therefore, systematic research and identification of MPs are urgently required. In this paper, we propose a multimodal deep ensemble learning architecture, named MEL-MP, which is the first de novo computation model for predicting MPs. First, we extract four sequence-based features: primary protein sequence information, evolutionary information, physical and chemical properties, and secondary protein structure information. Second, we select specific classifiers for each kind of feature. Finally, we apply the stacked ensemble to integrate the output of each classifier. Through comprehensive model selection and cross-validation experiments, it is shown that specific classifiers for specific feature types can achieve superior performance. For validating the effectiveness of the fusion-based stacked ensemble, different feature fusion strategies including direct combination and a multimodal deep auto-encoder are used for comparative purposes. MEL-MP is shown to exhibit superior prediction performance (F-score = 0.891), surpassing the existing machine learning model, MPFit (F-score = 0.784). In addition, MEL-MP is leveraged to predict the potential MPs among all human proteins. Furthermore, the distribution of predicted MPs on different chromosomes, the evolution of MPs, the association of MPs with diseases, and the functional enrichment of MPs are also explored. Finally, for maximum convenience, a user-friendly web server is available at: http://ml.csbg-jlu.site/mel-mp/.Entities:
Keywords: deep learning; ensemble learning; multimodal; prediction model; protein moonlighting
Year: 2021 PMID: 33828582 PMCID: PMC8019903 DOI: 10.3389/fgene.2021.630379
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1The MEL-MP workflow. The four sub-models correspond to four types of extracted features. Seq − BiLSTM denotes that the bidirectional long short-term memory neural network (BiLSTM) model is selected for primary sequence features. PSSM − RF means that the random forest (RF) classifier is used for PSSM evolution information. SS − ARF denotes that the classifier based on an auto-encoder and RF is applied for secondary structure features. AA − MLP means that the multilayer perceptron (MLP) model is applied for physical and chemical features. Finally, logistic regression (LR) integrates the outputs of all models.
The performance of all features tested by diverse of methods and ensemble leaning.
| Sequence | k-mer + LR | 0.792 | 0.743 | 0.746 |
| k-mer + RF | 0.779 | 0.773 | 0.758 | |
| k-mer + SVM | 0.767 | 0.748 | 0.740 | |
| Acf(k-mer) + LR | 0.766 | 0.755 | 0.755 | |
| Acf(k-mer) + RF | 0.819 | 0.813 | 0.812 | |
| Acf(k-mer) + SVM | 0.809 | 0.806 | 0.803 | |
| Acf(k-mer) + MLP | 0.827 | 0.835 | 0.827 | |
| Acf(k-mer) + 1DCNN | 0.822 | 0.818 | 0.814 | |
| PSSM | LR | 0.829 | 0.818 | 0.819 |
| SVM | 0.866 | 0.862 | 0.862 | |
| BiLSTM | 0.830 | 0.827 | 0.824 | |
| MLP | 0.866 | 0.862 | 0.862 | |
| 2DCNN | 0.865 | 0.848 | 0.850 | |
| Physical | LR | 0.796 | 0.792 | 0.789 |
| chemical | RF | 0.843 | 0.839 | 0.837 |
| property | SVM | 0.843 | 0.841 | 0.840 |
| 1DCNN | 0.833 | 0.827 | 0.824 | |
| BiLSTM | 0.837 | 0.825 | 0.822 | |
| Secondary | kmer + LR | 0.769 | 0.752 | 0.755 |
| Structure | kmer + RF | 0.797 | 0.794 | 0.792 |
| kmer + SVM | 0.806 | 0.801 | 0.798 | |
| Autoencoder(kmer) + LR | 0.749 | 0.738 | 0.738 | |
| Autoencoder(kmer) + SVM | 0.797 | 0.792 | 0.787 |
The bold part represents the best classification model for each feature.
Comparison with other feature fusion method and MEL-MP.
| LR | 0.836 | 0.822 | 0.823 |
| RF | 0.869 | 0.865 | 0.865 |
| SVM | 0.833 | 0.830 | 0.827 |
| Auto-encoder + LR | 0.664 | 0.650 | 0.651 |
| Auto-encoder + RF | 0.787 | 0.783 | 0.777 |
| Auto-encoder + SVM | 0.767 | 0.767 | 0.757 |
| Multimodal auto-encoder + LR | 0.841 | 0.832 | 0.831 |
| Multimodal auto-encoder + RF | 0.883 | 0.881 | 0.879 |
| Multimodal auto-encoder + SVM | 0.860 | 0.857 | 0.854 |
Each feature fusion method is shown in bold, and the experimental results of MEL-MP are also shown in bold.
Figure 2Roc curves of four single model and stacking method.
Figure 3Distribution of potential moonlighting proteins (MPs) in human chromosomes. The x-axis corresponds to chromosome pairs and sex chromosomes (x and y). The y-axis measures the ratio of predicted MPs on single chromosome pairs.
Figure 4Phylostrata for predicted moonlighting proteins (MPs) and other proteins.
Figure 5The top 13 diseases enriched by moonlighting proteins (MPs).
Figure 6Pathway diagram of predicted moonlighting proteins (MPs) and other proteins.
Figure 7The cluster of gene family of predicted moonlighting proteins (MPs) and other proteins.