| Literature DB >> 32601280 |
Chia-Ru Chung1, Ya-Ping Chang1, Yu-Lin Hsu1, Siyu Chen2, Li-Ching Wu3, Jorng-Tzong Horng4,5, Tzong-Yi Lee6,7.
Abstract
Protein malonylation, a reversible post-translational modification of lysine residues, is associated with various biological functions, such as cellular regulation and pathogenesis. In proteomics, to improve our understanding of the mechanisms of malonylation at the molecular level, the identification of malonylation sites via an efficient methodology is essential. However, experimental identification of malonylated substrates via mass spectrometry is time-consuming, labor-intensive, and expensive. Although numerous methods have been developed to predict malonylation sites in mammalian proteins, the computational resource for identifying plant malonylation sites is very limited. In this study, a hybrid model incorporating multiple convolutional neural networks (CNNs) with physicochemical properties, evolutionary information, and sequenced-based features was developed for identifying protein malonylation sites in mammals. For plant malonylation, multiple CNNs and random forests were integrated into a secondary modeling phase using a support vector machine. The independent testing has demonstrated that the mammalian and plant malonylation models can yield the area under the receiver operating characteristic curves (AUC) at 0.943 and 0.772, respectively. The proposed scheme has been implemented as a web-based tool, Kmalo (https://fdblab.csie.ncu.edu.tw/kmalo/home.html), which can help facilitate the functional investigation of protein malonylation on mammals and plants.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32601280 PMCID: PMC7324624 DOI: 10.1038/s41598-020-67384-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1The AACs of malonylation and non-malonylation sites in mammalians (upper) and plants (bottom).
Figure 2The two sample logo of malonylation sites in (a) H. sapiens, (b) M. musculus, and (c) T. aestivum.
Figure 3The two sample logo of malonylation sites in (a) human to mouse, (b) human to wheat, and (c) mouse to wheat.
Figure 4The PCC values of the features in mammalian (upper) and plant (bottom) proteins.
Figure 5The functional distributions of malonylated proteins for (a) H. sapiens, (b) M. musculus, and (c) T. aestivum, including GO terms in biological process, molecular functions and cellular components.
The performance of tenfold cross validation and independent testing for our proposed hybrid model in mammalian proteins. Since the hybrid model with RF models did not perform well, the final hybrid model did not incorporate them.
| Feature | Dimension | Window size | Model | ACC | SEN | SPE | MCC | AUC |
|---|---|---|---|---|---|---|---|---|
| One hot encoding | 30 × 21 | 31 | CNN | 0.713 | 0.712 | 0.714 | 0.233 | 0.784 |
| AAINDEX | 32 × 46 | 33 | CNN | 0.730 | 0.544 | 0.743 | 0.147 | 0.741 |
| PSSM | 33 × 20 | 33 | CNN | 0.707 | 0.707 | 0.707 | 0.216 | 0.775 |
| AAC | 21 | 23 | RF | 0.620 | 0.598 | 0.623 | 0.143 | 0.654 |
| PAAC | 34 | 17 | RF | 0.624 | 0.612 | 0.628 | 0.210 | 0.671 |
Ensemble with NN (without RF models) | 0.764 | 0.653 | 0.661 | 0.174 | 0.742 | |||
Ensemble with NN (without RF models) | 0.866 | 0.910 | 0.864 | 0.480 | 0.943 | |||
PSSM position specific scoring matrix, AAC amino acid composition, PAAC pseudo-amino acid composition, CNN convolutional neural network, RF random forest, NN neural network, ACC accuracy, SEN sensitivity, SPE specificity, MCC Matthews correlation coefficient, AUC area under the receiver operating characteristic curve.
The performance of tenfold cross validation and independent testing for our proposed hybrid model in plant proteins.
| Feature | Dimension | Window size | Model | ACC | SEN | SPE | MCC | AUC |
|---|---|---|---|---|---|---|---|---|
| One hot encoding | 32 × 21 | 33 | CNN | 0.598 | 0.572 | 0.600 | 0.095 | 0.635 |
| AAINDEX | 26 × 30 | 27 | CNN | 0.637 | 0.577 | 0.642 | 0.121 | 0.673 |
| PSSM | 31 × 20 | 31 | CNN | 0.614 | 0.571 | 0.617 | 0.103 | 0.647 |
| AAC | 20 | 39 | RF | 0.654 | 0.632 | 0.656 | 0.161 | 0.720 |
| PAAC | 34 | 39 | RF | 0.661 | 0.633 | 0.663 | 0.166 | 0.718 |
ensemble with SVM | 0.660 | 0.653 | 0.661 | 0.174 | 0.742 | |||
ensemble with SVM | 0.691 | 0.682 | 0.692 | 0.195 | 0.772 | |||
PSSM position specific scoring matrix, AAC amino acid composition, PAAC pseudo-amino acid composition, CNN convolutional neural network, RF random forest, NN neural network, ACC accuracy, SEN sensitivity, SPE specificity, MCC Matthews correlation coefficient, AUC area under the receiver operating characteristic curve.
The comparisons of our model with Kmal-sp amd LEMP for predicting malonylation sites in mammalian proteins, respectively.
| Tool | TP | FN | ACC | SEN | SPE | Time consuming |
|---|---|---|---|---|---|---|
| FP | TN | |||||
| Kmal-sp | 270 | 190 | 0.597 | 0.587 | 0.598 | Around 2 days |
| 4,138 | 6,151 | |||||
| Kmalo (proposed) | 303 | 157 | 0.674 | 0.659 | 0.675 | Within minutes |
| 3,349 | 6,940 | |||||
| LEMP | 1,130 | 121 | 0.862 | 0.903 | 0.860 | Within minutes |
| 2,672 | 16,389 | |||||
| Kmalo | 1,138 | 113 | 0.866 | 0.910 | 0.864 | Within minutes |
| 2,600 | 16,461 |
TP true positive, FP false positive, FN false negative, TN true negative, ACC accuracy, SEN sensitivity, SPE specificity.
Figure 6ROC curves on the independent testing for comparing with Kmal-sp (left) and LEMP (right).
The number of malonylation and non-malonylation sites used in this study.
| Species | Training set | Testing set | Independent testing set | |||
|---|---|---|---|---|---|---|
| Positive | Negative | Positive | Negative | Positive | Negative | |
| 5,006 | 76,264 | 1,252 | 19,066 | 460+ | 10,289+ | |
| 1,251# | 19,061# | |||||
| 196 | 2,394 | 82 | 1,195 | 82 | 1,195 | |
“ + ” means the testing set for Kmal-sp tool; “#” means the testing for LEMP tool.