| Literature DB >> 26861308 |
Runtao Yang1, Chengjin Zhang2,3, Rui Gao4, Lina Zhang5.
Abstract
The Golgi Apparatus (GA) is a major collection and dispatch station for numerous proteins destined for secretion, plasma membranes and lysosomes. The dysfunction of GA proteins can result in neurodegenerative diseases. Therefore, accurate identification of protein subGolgi localizations may assist in drug development and understanding the mechanisms of the GA involved in various cellular processes. In this paper, a new computational method is proposed for identifying cis-Golgi proteins from trans-Golgi proteins. Based on the concept of Common Spatial Patterns (CSP), a novel feature extraction technique is developed to extract evolutionary information from protein sequences. To deal with the imbalanced benchmark dataset, the Synthetic Minority Over-sampling Technique (SMOTE) is adopted. A feature selection method called Random Forest-Recursive Feature Elimination (RF-RFE) is employed to search the optimal features from the CSP based features and g-gap dipeptide composition. Based on the optimal features, a Random Forest (RF) module is used to distinguish cis-Golgi proteins from trans-Golgi proteins. Through the jackknife cross-validation, the proposed method achieves a promising performance with a sensitivity of 0.889, a specificity of 0.880, an accuracy of 0.885, and a Matthew's Correlation Coefficient (MCC) of 0.765, which remarkably outperforms previous methods. Moreover, when tested on a common independent dataset, our method also achieves a significantly improved performance. These results highlight the promising performance of the proposed method to identify Golgi-resident protein types. Furthermore, the CSP based feature extraction method may provide guidelines for protein function predictions.Entities:
Keywords: common spatial patterns; golgi apparatus proteins; random forest; recursive feature elimination; synthetic minority over-sampling technique
Mesh:
Substances:
Year: 2016 PMID: 26861308 PMCID: PMC4783950 DOI: 10.3390/ijms17020218
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1The system architecture of the proposed method. PSSM: Position Specific Scoring Matrix, DC: Dipeptide Composition, ED: Evolutionary Difference, CSP: Common Spatial Patterns, SMOTE: Synthetic Minority Over-sampling Technique, RF: Random Forest, RFE: Recursive Feature Elimination.
Figure 2Average amino acid frequencies of cis-Golgi and trans-Golgi proteins.
Figure 3The performance of g-gap dipeptide composition features with various g values on the trainning dataset. Acc: Accuracy, AUC: Area Under the ROC Curve.
Detailed predictive results of the current method trained by g-gap dipeptide composition with different g.
|
| Sensitivity | Specificity | Accuracy | MCC | AUC |
|---|---|---|---|---|---|
| 0 | 0.714 | 0.908 | 0.811 | 0.634 | 0.848 |
| 1 | 0.724 | 0.894 | 0.809 | 0.627 | 0.854 |
| 2 | 0.724 | 0.899 | 0.811 | 0.632 | 0.847 |
| 3 | 0.733 | 0.926 | 0.829 | 0.672 | 0.858 |
| 4 | 0.705 | 0.899 | 0.802 | 0.615 | 0.836 |
| 5 | 0.700 | 0.903 | 0.802 | 0.616 | 0.848 |
| 6 | 0.710 | 0.922 | 0.816 | 0.646 | 0.856 |
| 7 | 0.700 | 0.889 | 0.795 | 0.601 | 0.844 |
| 8 | 0.705 | 0.935 | 0.820 | 0.658 | 0.844 |
Prediction results of the CSP based feature extraction method and traditional feature extraction methods from evolutionary information.
| Method | Sensitivity | Specificity | Accuracy | MCC | AUC | Feature Number |
|---|---|---|---|---|---|---|
| PSSM-DC | 0.843 | 0.774 | 0.809 | 0.619 | 0.873 | 400 |
| CSP-PSSM-DC | 0.705 | 0.899 | 0.802 | 0.615 | 0.855 | 20 |
| Bi-gram PSSM | 0.710 | 0.922 | 0.816 | 0.646 | 0.909 | 400 |
| CSP-Bi-gram PSSM | 0.843 | 0.770 | 0.806 | 0.615 | 0.881 | 20 |
| ED-PSSM | 0.876 | 0.820 | 0.848 | 0.697 | 0.903 | 400 |
| CSP-ED-PSSM | 0.848 | 0.829 | 0.839 | 0.678 | 0.908 | 20 |
The performance of models trained with combined features.
| Training Feature | Sensitivity | Specificity | Accuracy | MCC | AUC |
|---|---|---|---|---|---|
| 3-gap DC+CSP-PSSM-DC | 0.853 | 0.816 | 0.834 | 0.669 | 0.887 |
| 3-gap DC+CSP-Bi-gram PSSM | 0.853 | 0.839 | 0.846 | 0.691 | 0.887 |
| 3-gap DC+CSP-ED-PSSM | 0.876 | 0.843 | 0.859 | 0.719 | 0.905 |
| 3-gap DC+CSP-PSSM-DC+CSP-Bi-gram PSSM | 0.857 | 0.793 | 0.825 | 0.651 | 0.882 |
| 3-gap DC+CSP-PSSM-DC+CSP-ED-PSSM | 0.862 | 0.843 | 0.853 | 0.705 | 0.899 |
| 3-gap DC+CSP-Bi-gram PSSM+CSP-ED-PSSM | 0.843 | 0.839 | 0.841 | 0.682 | 0.894 |
| 3-gap DC+CSP-PSSM-DC+CSP-Bi-gram PSSM+CSP-ED-PSSM | 0.876 | 0.853 | 0.864 | 0.728 | 0.912 |
Prediction results with and without SMOTE.
| Method | Sensitivity | Specificity | Accuracy | MCC |
|---|---|---|---|---|
| Without SMOTE | 0.184 | 0.949 | 0.730 | 0.048 |
| With SMOTE | 0.876 | 0.853 | 0.864 | 0.728 |
Figure 4ROC (Receiver-Operating Characteristic) curves with and without SMOTE.
Figure 5The value against the dimension of top features. The maximum value is 0.901 when the first 55 features in the ranked feature list are selected.
Prediction results for Golgi-resident protein types using 3-gap DC+CSP-PSSM-DC+CSP-Bi-gram PSSM+CSP-ED-PSSM with and without feature selection.
| Method | Sensitivity | Specificity | Accuracy | MCC | Feature Number |
|---|---|---|---|---|---|
| without feature selection | 0.876 | 0.853 | 0.864 | 0.728 | 460 |
| With feature selection | 0.908 | 0.894 | 0.901 | 0.802 | 55 |
Figure 6ROC (Receiver-Operating Characteristic) curves with and without feature selection.
Performance comparisons with the existing methods on the training dataset by the jackknife cross validation, where better results are highlighted in bold.
| Reference | Sensitivity | Specificity | Accuracy | MCC | Feature Number |
|---|---|---|---|---|---|
| [ | 0.696 | 0.796 | 0.747 | 0.517 | 400 |
| [ | 0.738 | 0.854 | 0.652 | 83 | |
| This study | 0.880 |
Performance comparison with the existing methods on the independent testing dataset.
| Reference | Sensitivity | Specificity | Accuracy | MCC | Feature Number |
|---|---|---|---|---|---|
| [ | 0.692 | 0.902 | 0.859 | 0.578 | 83 |
| This study | 0.923 | 0.941 | 0.938 | 0.821 | 55 |