| Literature DB >> 32322367 |
Hussam Al-Barakati1, Niraj Thapa1, Saigo Hiroto2, Kaushik Roy1, Robert H Newman3, Dukka Kc4.
Abstract
Malonylation, which has recently emerged as an important lysine modification, regulates diverse biological activities and has been implicated in several pervasive disorders, including cardiovascular disease and cancer. However, conventional global proteomics analysis using tandem mass spectrometry can be time-consuming, expensive and technically challenging. Therefore, to complement and extend existing experimental methods for malonylation site identification, we developed two novel computational methods for malonylation site prediction based on random forest and deep learning machine learning algorithms, RF-MaloSite and DL-MaloSite, respectively. DL-MaloSite requires the primary amino acid sequence as an input and RF-MaloSite utilizes a diverse set of biochemical, physiochemical and sequence-based features. While systematic assessment of performance metrics suggests that both 'RF-MaloSite' and 'DL-MaloSite' perform well in all metrics tested, our methods perform particularly well in the areas of accuracy, sensitivity and overall method performance (assessed by the Matthew's Correlation Coefficient). For instance, RF-MaloSite exhibited MCC scores of 0.42 and 0.40 using 10-fold cross-validation and an independent test set, respectively. Meanwhile, DL-MaloSite was characterized by MCC scores of 0.51 and 0.49 based on 10-fold cross-validation and an independent set, respectively. Importantly, both methods exhibited efficiency scores that were on par or better than those achieved by existing malonylation site prediction methods. The identification of these sites may also provide important insights into the mechanisms of crosstalk between malonylation and other lysine modifications, such as acetylation, glutarylation and succinylation. To facilitate their use, both methods have been made freely available to the research community at https://github.com/dukkakc/DL-MaloSite-and-RF-MaloSite.Entities:
Keywords: Convolutional neural network; Deep learning; Malonylation; Post-translational Modification Sites; Random forest
Year: 2020 PMID: 32322367 PMCID: PMC7160427 DOI: 10.1016/j.csbj.2020.02.012
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig 1Schematic diagrams illustrating the architectures of RF-MaloSite (A) and DL-MaloSite (B).
The number of positive and negative sites in the training and independent sets before (left) and after (right) balancing:
| Original data | Positive sites (before/after) | Negative sites (before/after) |
|---|---|---|
| Training | 3978/3978 | 68,595/3978 |
| Testing | 988/988 | 16,097/988 |
Composition of the complete feature set used during initial model development. Detailed descriptions of each feature class except EAAC, AAP, and AAI are provided in [40].
| NO | Feature Class | Abbreviation | Feature Length |
|---|---|---|---|
| 1 | Binary feature | BIN | 560 |
| 2 | Amino Acid Composition | AAC | 20 |
| 3 | Composition of amino acid pairs | AACP | 400 |
| 4 | K-space amino acid pairs | KSAAP | 2400 |
| 5 | Composition, Transition and Distribution | CTD | 147 (C:21; T:21, D:105) |
| 6 | Conjoint triad | CT | 512 |
| 7 | Pseudo-amino acid composition | PseAAC | 47 |
| 8 | Autocorrelation | Au | 720 |
| 9 | Amino acid factor | AAF | 140 |
| 10 | Amino acid properties | AAP | 392* |
| 11 | Amino acid index | AAI | 812* |
| 12 | Enhanced amino acid composition | EAAC | 440* |
| 13 | Combined all features | Total | 6590 |
Fig 2The distributions of each kind of attribute for optimal features with the corresponding percentage score. The total number of selected attributes was 104.
Fig 3Top ten most important features with corresponding weights.
Top ten features ranked by information gain score. PseAAC: Pseudo-amino acid composition; EAAC: Enhanced amino acid composition; CTD: Composition, distribution and transition.
| Rank | Features Type | Description |
|---|---|---|
| 1 | APAAC8 | PseAAC feature at a particular site |
| 2 | ChargeD1100 | Distribution feature denoting positive charge of class 1 at last position |
| 3 | SW.12.K | EAAC feature denoting enrichment of lysine at portion 12 |
| 4 | PolarityC1 | Composition feature of class 1 for polarity property |
| 5 | ChargeD1001 | Distribution feature which denotes positive charge of class 1 at first position |
| 6 | APAAC1 | PseAAC feature at particular site |
| 7 | ChargeD1025 | Distribution feature which denote positive charge of class 1 at second location |
| 8 | APAAC20 | PseAAC feature at particular site |
| 9 | SolventAccessibilityC1 | Composition feature of class 1 for Solvent Accessibility |
| 10 | SW.15.K | EAAC feature denoting enrichment of lysine at portion 15 |
Comparison between various machine learning algorithms based on 10-fold cross-validation using 104 features. ACC: Accuracy; SN: Sensitivity; SP: Specificity; MCC: Matthew’s Correlation Coefficient.
| Classifier | ACC (%) | SN (%) | SP (%) | MCC |
|---|---|---|---|---|
| SVM | 64 | 71 | 58 | 0.30 |
| KNN | 61 | 67 | 56 | 0.23 |
| NN | 69 | 73 | 65 | 0.37 |
Comparison between our methods and various machine learning algorithms based on an independent test set using 104 features. ACC: Accuracy; SN: Sensitivity; SP: Specificity; MCC: Matthew’s Correlation Coefficient.
| SVM | 65 | 69 | 62 | 0.29 |
| KNN | 62 | 66 | 58 | 0.23 |
| NN | 68 | 76 | 58 | 0.35 |
Fig 4Precision recall (PR) curve based on the independent test set for DL-MaloSite (red) and RF-MaloSite (blue). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Comparison between SPRINT-Mal [18], iLMS [19], RF-MaloSite and DL-MaloSite based on 10-fold cross-validation using a M. musculus dataset. ACC: Accuracy; SN: Sensitivity; SP: Specificity; MCC: Matthew’s Correlation Coefficient.
| Features | Organism | ACC (%) | SN (%) | SP (%) | MCC | AUC |
|---|---|---|---|---|---|---|
| SPRINT-Mal | 49 | 0.21 | 0.74 | |||
| iLMS | – | 49 | 80 | 0.26 | 0.74 | |
| RF-MaloSite | 68 | 65 | ||||
| DL-MaloSite | 70 | 68 | 0.73 |
Comparison between SPRINT-Mal [18], iLMS [19], RF-MaloSite and DL-MaloSite based on an independent test set from M. musculus. ACC: Accuracy; SN: Sensitivity; SP: Specificity; MCC: Matthew’s Correlation Coefficient.
| Features | Organism | ACC (%) | SN (%) | SP (%) | MCC | AUC |
|---|---|---|---|---|---|---|
| SPRINT-Mal | 33 | 0.20 | ||||
| iLMS | – | – | – | 0.26 | 0.72 | |
| RF-MaloSite | 65 | 65 | 0.72 | |||
| DL-MaloSite | 68 | 51 | 0.72 |
Comparison between SPRINT-Mal [18], iLMS [19] RF-MaloSite and DL-MaloSite based on 10-fold cross-validation using a H. sapiens dataset. ACC: Accuracy; SN: Sensitivity; SP: Specificity; MCC: Matthew’s Correlation Coefficient.
| Features | Organism | ACC (%) | SN (%) | SP (%) | MCC | AUC |
|---|---|---|---|---|---|---|
| SPRINT-Mal | – | – | – | – | – | |
| iLMS | – | 48 | 0.23 | 0.74 | ||
| RF-MaloSite | 67 | 71 | 62 | 0.34 | 0.74 | |
| DL-MaloSite (Transfer Learning) | 65 |
Comparison between SPRINT-Mal [18], iLMS [19] RF-MaloSite and DL-MaloSite based on an independent test set from H. sapiens. ACC: Accuracy; SN: Sensitivity; SP: Specificity; MCC: Matthew’s Correlation Coefficient.
| Features | Organism | ACC (%) | SN (%) | SP (%) | MCC | AUC |
|---|---|---|---|---|---|---|
| SPRINT-Mal | 35 | 0.20 | 0.70 | |||
| iLMS | – | – | – | – | – | |
| RF-MaloSite | 62 | 66 | 59 | 0.24 | 0.66 | |
| DL-MaloSite (Transfer Learning) | 69 | 60 |
Comparison between SPRINT-Mal [18], iLMS [19] RF-MaloSite and DL-MaloSite based on an independent test set from S. erythraea. ACC: Accuracy; SN: Sensitivity; SP: Specificity; MCC: Matthew’s Correlation Coefficient.
| Features | Organism | ACC (%) | SN (%) | SP (%) | MCC | AUC |
|---|---|---|---|---|---|---|
| SPRINT-Mal | 23 | 0.12 | 0.64 | |||
| iLMS | – | – | – | – | – | |
| RF-MaloSite | 65 | 77 | 53 | |||
| DL-MaloSite | 68 | 47 |
Comparison between DL-MaloSite, RF-MaloSite and LEMP based on 10-fold cross-validation. ACC: Accuracy; SN: Sensitivity; SP: Specificity; MCC: Matthew’s Correlation Coefficient.
| Features | ACC (%) | SN (%) | SP (%) | MCC | AUC |
|---|---|---|---|---|---|
| LEMP | 42 | 0.24 | |||
| RF-MaloSite | 71 | 79 | 63 | 0.42 | 0.77 |
| DL-MaloSite | 75 | 69 | 0.81 |
Comparison between DL-MaloSite, RF-MaloSite and LEMP based on the independent test. ACC: Accuracy; SN: Sensitivity; SP: Specificity; MCC: Matthew’s Correlation Coefficient.
| Features | ACC (%) | SN (%) | SP (%) | MCC | AUC |
|---|---|---|---|---|---|
| LEMP | 44 | 0.24 | |||
| RF-MaloSite | 70 | 76 | 63 | 0.40 | 0.75 |
| DL-MaloSite | 74 | 68 | 0.81 |