| Literature DB >> 33367627 |
Zhibin Lv1, Pingping Wang2, Quan Zou1,3,4, Qinghua Jiang2.
Abstract
MOTIVATION: The Golgi apparatus has a key functional role in protein biosynthesis within the eukaryotic cell with malfunction resulting in various neurodegenerative diseases. For a better understanding of the Golgi apparatus, it is essential to identification of sub-Golgi protein localization. Although some machine learning methods have been used to identify sub-Golgi localization proteins by sequence representation fusion, more accurate sub-Golgi protein identification is still challenging by existing methodology.Entities:
Year: 2020 PMID: 33367627 PMCID: PMC8023683 DOI: 10.1093/bioinformatics/btaa1074
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Modeling overview. The Golgi protein sequence is firstly convert into 1900 D features by use of the deep representation learning model, UniRep. Then 1900 D features are fed into ten classifiers; or 1900 D feature vectors are filtered by LGBM feature selection technology to reduce into 250 dimension vectors, which then fed into ten classifiers with SMOTE or not. In the next step, the top 2 classifiers are selected for further optimization with LGBM, ANOVA and MRMD feature selection. Finally, the optimal model (SVM) is used in the isGP-DRLF webserver
Fig. 2.Ten-fold cross-validation accuracy metrics of Boxplots and ROC curves for ten classifiers (LR: Logistic Regression, KNN: K-nearest Neighbors, DT: Decision Tree, NB: Gaussian Naive Bayes, Bagging: Bagging, RF: Random Forest, AB: Ada Boosting, LGBM: Light Gradient Boosting Machine, SVM: Supporting Vector Machine, LDA: Linear Discriminant Analysis) using different feature processing technologies. A and B utilized UniRep feature vectors with 1900 dimensions; C and D used SMOTE to balance the UniRep feature vectors with 1900 dimensions; for E and F, based on the previous steps, 250 features were selected by using the LGBM feature selection method. Green Triangles and orange lines in A, C and E are the average accuracy values and the median accuracy values for the 10-fold cross-validation. In either case, SVM classifier had the highest average accuracy (77.32%, 90.31% and 90.76%, respectively) and the highest average auROC value (0.765, 0.940 and 0.958, respectively)
Fig. 3.(A) Based on benchmark dataset D3, the average 10-fold cross-validation accuracy varied with the feature numbers for LGBM and SVM classifiers based on ANOVA, MRMD and LGBM feature selection technology. The best SVM had an accuracy of 92.16% with 158 features. The best LGBM classifier had an accuracy of 93.08% with 64 features. Both were based on LGBM feature selection technology. (B) Ten-fold cross-validation and LOO metrics for comparison of the best SVM (based on benchmark dataset D3 and D5) and LGBM classifier (based on benchmark dataset D3). (C) Independent test metrics on benchmark testing dataset D4 for the best SVM and LGBM classifier obtained by LOO using benchmark dataset D3 and D5
Evaluation metrics comparisons of support vector machine classifiers based on different state-of-the-art deep representation learning features
| Feature type | Trained dataset (identity) | Feature dimensions | LOO cross-validation | Independent testing | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ACC (%) | MCC | Sn (%) | Sp (%) | auROC | ACC (%) | MCC | Sn (%) | Sp (%) | auROC | |||
| UniRep | D3 (40%) | 158 | 92.6 | 0.85 | 94.9 | 90.3 | 0.964 | 98.4 | 0.95 | 100 | 98.0 | 0.995 |
| D5 (25%) | 107 | 99.2 | 0.98 | 100 | 98.4 | 0.999 | 96.4 | 0.90 | 100 | 84.6 | 0.994 | |
| BiLSTM-lm | D3 (40%) | 77 | 88.7 | 0.78 | 85.2 | 92.1 | 0.917 | 92.1 | 0.75 | 96.1 | 76.9 | 0.983 |
| D5 (25%) | 152 | 99.4 | 0.99 | 99.9 | 98.9 | 0.999 | 92.2 | 0.75 | 100 | 61.5 | 0.989 | |
| BiLSTM-ssa | D3 (40%) | 93 | 91.7 | 0.83 | 90.7 | 92.6 | 0.946 | 90.6 | 0.71 | 94.1 | 76.9 | 0.975 |
| D5 (25%) | 48 | 99.8 | 0.99 | 100 | 99.7 | 0.999 | 87.5 | 0.58 | 100 | 38.5 | 0.956 | |
| TAPE-pooled | D3 (40%) | 77 | 90.3 | 0.81 | 89.9 | 90.7 | 0.941 | 90.6 | 0.70 | 96.0 | 69.2 | 0.966 |
| D5 (25%) | 53 | 98.7 | 0.97 | 100 | 97.5 | 0.999 | 90.6 | 0.69 | 98.0 | 61.5 | 0.927 | |
| TAPE-avg | D3 (40%) | 67 | 91.9 | 0.84 | 94.0 | 89.9 | 0.963 | 96.4 | 0.91 | 100 | 96.1 | 0.985 |
| D5 (25%) | 73 | 99.7 | 0.99 | 100 | 99.3 | 0.999 | 89.1 | 0.64 | 100 | 46.1 | 0.989 | |
Evaluation metrics comparisons of the state-of-the-art classifiers
| Classifier | Trained dataset | Feature type numbers | Features dimensions | LOO Cross-validation | Independent testing | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| ACC (%) | MCC | Sn (%) | Sp (%) | ACC (%) | MCC | Sn (%) | Sp (%) | ||||
| SVM (this study) | D3 | 1 | 158 | 92.6 | 0.85 | 94.9 | 90.3 | 98.4 | 0.95 | 100 | 98.0 |
| SVM (this study) | D5 | 1 | 107 | 99.2 | 0.98 | 100% | 98.4 | 96.4 | 0.90 | 100 | 84.6 |
| KNN ( | D3 | 3 | 83 | 94.9 | 0.90 | 97.2 | 92.6 | 94.8 | 0.86 | 94.0 | 93.9 |
| KNN ( | D3 | 3 | 180 | 98.2 | 0.96 | 98.6 | 97.7 | 94.0 | 0.84 | 81.5 | 96.9 |
| RF ( | D3 | 4 | 55 | 88.5 | 0.68 | 88.9 | 88.0 | 93.8 | 0.82 | 92.3 | 94.1 |
| SVM ( | D3 | 6 | 2800 | 95.9 | 0.92 | 95.9 | 92.6 | 95.3 | 0.85 | 84.6 | 98.0 |
Fig. 4.Human sub-Golgi proteome sequence distribution and the results of isGP-DRLF and suGolgi2 tested on human sub-Golgi proteome dataset