| Literature DB >> 31552241 |
Zhibin Lv1, Shunshan Jin2, Hui Ding3, Quan Zou1,3.
Abstract
To gain insight into the malfunction of the Golgi apparatus and its relationship to various genetic and neurodegenerative diseases, the identification of sub-Entities:
Keywords: ANOVA feature selection; k-gap dipeptide; random forests; split amino acid composition; sub-Golgi protein classifier; synthetic minority over-sampling
Year: 2019 PMID: 31552241 PMCID: PMC6737778 DOI: 10.3389/fbioe.2019.00215
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
Figure 1Modeling framework of the state-of-art random forests sub-Golgi protein classifier. ANOVA: analysis of variance.
Jackknife cross-validation and independent testing results after training on the benchmark data set D1 without feature selection.
| 2-gapDC(400) | N | 74.5% | 0.326 | 94.7% | 28.6% | 79.7% | 0.318 | 90.2% | 38.5% |
| SAAC(60) | N | 69.3% | 0.073 | 97.9% | 4.8% | 78.1% | −0.07 | 98.0% | 0.0% |
| 2-gapDC+SAAC(460) | N | 75.2% | 0.351 | 94.7% | 31.0% | 79.7% | 0.237 | 94.1% | 23.1% |
| 2-gapDC(400) | Y | 86.3% | 0.743 | 96.8% | 75.8% | 82.8% | 0.351 | 98.0% | 23.1% |
| SAAC(60) | Y | 87.9% | 0.763 | 93.7% | 82.1% | 81.2% | 0.388 | 90.2% | 46.2% |
| SAAC+2-gapDC(460) | Y | 90.5% | 0.817 | 96.8% | 84.2% | 81.2% | 0.287 | 96.1% | 23.1% |
Figure 2Jackknife cross-validation and independent testing accuracy of the random forest classifier with the number of features varied: (A) 2-gap dipeptide composition (2-gapDC) features (B) 59 selected 2-gapDC features + 60 split amino acid composition (SAAC) features, and (C) 55 selected 2-gapDC features + 60 SAAC features.
The best evaluation scores from jackknife cross-validation and independent testing of different models with various feature types and feature numbers.
| rfGPT_1 | 2-gapDC(59) | 93.2% | 0.864 | 94.7% | 91.6% | 84.4% | 0.466 | 94.1% | 46.2% |
| rfGPT_2 | 2-gapDC(55) | 91.1% | 0.823 | 94.7% | 87.4% | 89.1% | 0.631 | 98.0% | 53.8% |
| rfGPT_3 | 2-gapDC+SAAC(43) | 93.7% | 0.874 | 93.7% | 93.7% | 82.8% | 0.484 | 88.2% | 61.5% |
| rfGPT_4 | 2-gapDC+SAAC(93) | 90.5% | 0.811 | 92.6% | 88.4% | 90.6% | 0.696 | 96.1% | 69.2% |
| rfGPT_5 | 2-gapDC+SAAC(94) | 93.2% | 0.864 | 93.7% | 92.7% | 84.4% | 0.546 | 88.2% | 69.2% |
| rfGPT_6 | 2-gapDC+SAAC(66) | 90.0% | 0.800 | 89.5% | 90.5% | 89.1% | 0.695 | 90.2% | 84.6% |
Figure 3Feature importance analysis of random forests sub-Golgi classifier, rfGPT _4: (A) importance of feature types (B) the ranking orders of 93 features for rfGPT_4 and their integrated importance (red line), and (C) the importance of the top 25 features, which accounted for 50% of the integrated importance (blue line). The A1A2.gap2 means the composition of dipeptide A1A2. A1 or A2 is one of the 20 amino acid residues. Nterminal_D means the composition of amino acid residues D (aspartate) in NH2-terminal of protein sequence. InterTier_K, interTier_W, and interTier_F mean K(lysine), W(tryptophan), and F(phenylalanine) amino acid residues composition of the inter-tier between NH2-terminal and COOH-terminal of protein sequence.
Jackknife cross-validation and independent testing scores list for reported sub-Golgi protein classifiers.
| 1 | IDMD (Ding et al., | D0 | 2-gapDC | 400 | 74.7% | 0.495 | 79.6% | 69.6% | / | / | / | / |
| 2 | SVM (Ding et al., | D1 | 2-gapDC | 83 | 85.4% | 0.652 | 90.5% | 90.5% | 85.9% | 0.578 | 90.2% | 69.2% |
| 3 | SVM (Jiao and Du, | D1 | PSPCP | 59 | 86.9% | 0.684 | 92.6% | 73.8% | / | / | 90.2% | 69.2% |
| 4 | SVM (Jiao and Du, | D1 | PSPCP | 49 | 91.2% | 0.793 | 99.0% | 73.8% | 87.1% | / | / | / |
| 5 | SVM (Lin et al., | D1 | TPDC | 501 | 97.1% | 0.949 | 100% | 92.9% | / | / | / | / |
| 6 | SVM (Rahman et al., | D2 | ACC +DPDC +TPDC +2-gapDC +PseAAC | 2800 | 95.9% | 0.920 | 95.9% | 92.6% | 93.8% | 0.85 | 98.0% | 84.6% |
| 7 | KNN (Ahmad et al., | D2 | PseAAC +3-gapDC +Bigram-PSSM | 83 | 94.9% | 0.90 | 97.2% | 92.6% | 94.8% | 0.86 | 93.9% | 94.0% |
| 8 | KNN (Ahmad and Hayat, | D2 | SAAC +PSSM +3-gapDC | 180 | 98.2% | 0.96 | 98.6% | 97.7% | 94% | 0.84 | 96.9% | 81.5% |
| 9 | RF (Yang R. et al., | D2 | 3-gapDC +CSP-PSSMDC +CSP-BigramPSSM +CSP-EDPSSM | 55 | 88.5% | 0.765 | 88.9% | 88% | 93.8% | 0.821 | 94.1% | 92.3% |
| 10 | RF (this work) | D1 | 2-gapDC+SAAC | 93 | 90.5% | 0.811 | 92.6% | 88.4% | 90.6% | 0.696 | 96.1% | 69.2% |