| Literature DB >> 28053671 |
Saw Simeon1, Watshara Shoombuatong1, Nuttapat Anuwongcharoen1, Likit Preeyanon2, Virapong Prachayasittikul2, Jarl E S Wikberg3, Chanin Nantasenamat1.
Abstract
BACKGROUND: Currently, monomeric fluorescent proteins (FP) are ideal markers for protein tagging. The prediction of oligomeric states is helpful for enhancing live biomedical imaging. Computational prediction of FP oligomeric states can accelerate the effort of protein engineering efforts of creating monomeric FPs. To the best of our knowledge, this study represents the first computational model for predicting and analyzing FP oligomerization directly from the amino acid sequence.Entities:
Keywords: Data mining; FP; Fluorescent protein; GFP; Green fluorescent protein; Oligomeric state; Web server
Year: 2016 PMID: 28053671 PMCID: PMC5167684 DOI: 10.1186/s13321-016-0185-8
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Summary of existing studies for predicting oligomeric states from protein sequences
| Data set | Method | Internal set size | External set size | Sequence features | Source |
|---|---|---|---|---|---|
| SWISS-PROT (release 34) | DT | 1639 | N/A | PCP | [ |
| SVM | 1639 | N/A | AAC, AC | [ | |
| FDOD | 1639 | N/A | QSO | [ | |
| SWISS-PROT (release 34) after removing similar protein sequence | SVM | 1568 | N/A | QSO | [ |
| SVM | 1568 | N/A | AAC, DPC, AACD | [ | |
| k-NN | 1568 | N/A | QSO | [ | |
| SVM | 1568 | 1283 | PseAAC | [ | |
| SWISS-PROT (release 40) | DA | 3174 | 332 | PseAAC | [ |
| SVM | 3174 | N/A | FS, MSE | [ | |
| NN | 3174 | 332 | PseAAC | [ | |
| UniProtKB (release 15.6) | Probability | 5495 | N/A | AAC, DPC | [ |
| Fuzzy | 5495 | N/A | PseAAC | [ | |
| SWISS-PROT (release 55.3) | OET- | 6702 | N/A | FunD, PsePSSM | [ |
|
| 6702 | N/A | PseAAC, PCP | [ | |
| FP data set | DT | 318 | 79 | AAC, DPC, TPC | This study |
| AC, CTD, Ctriad | |||||
| QSO, PseAAC |
DT decision tree, discrete wavelength transform and decision tree, FDOD function of degree of disagreement, DA discriminatory analysis, SVM support vector machine, NN neural network, k-NN k-nearest neighbors, Fuzzy k-NN Fuzzy k-nearest neighbors, OET-k-NN optimized evidence-theoretic k-NN algorithm, AAC amino acid composition, AACD amino acid composition distribution, AC autocorrelation descriptors derived from several physicochemical properties including Geary, Moreau-Broto and Moran, APseAAC amphiphilic pseudo-amino acid composition, CTD composition, transition and distribution, Ctriad conjoint triad descriptors, DPC dipeptide composition, discrete wavelet transform and decision tree, FDOD function of degree of disagreement, FS the factor scores, FunD functional domain composition, MSE multi-scale energy, PCP physicochemical properties, PseAAC pseudo amino acid composition, PsePSSM pseudo position-specific score matrix, TPC tripeptide composition, QSO quasi-sequence-order descriptors
Fig. 1Workflow of QSPR modeling for predicting oligomeric states of FP
Summary of performance of QSAR models for predicting the oligomeric state of FPs (100% homologous sequence reduction) using the J48 algorithm
| Descriptors | Training set | Tenfold CV set | External set | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ac (%) | Sn (%) | Sp (%) | MCC | Ac (%) | Sn (%) | Sp (%) | MCC | Ac (%) | Sn (%) | Sp (%) | MCC | |
| AAC/DPC/TPC | 97.40 ± 0.77 | 97.72 ± 1.23 | 97.14 ± 1.31 | 0.95 ± 0.02 | 83.07 ± 2.04 | 83.26 ± 2.19 | 82.95 ± 2.27 | 0.66 ± 0.04 | 83.26 ± 3.58 | 83.77 ± 5.14 | 83.37 ± 4.45 | 0.67 ± 0.07 |
| AC | 98.70 ± 0.58 | 98.54 ± 0.93 | 98.86 ± 0.75 | 0.97 ± 0.01 | 78.36 ± 2.35 | 78.36 ± 2.36 | 78.67 ± 2.30 | 0.57 ± 0.04 | 78.49 ± 4.76 | 78.65 ± 5.66 | 78.89 ± 5.62 | 0.57 ± 0.10 |
| CTD | 97.58 ± 0.82 | 98.05 ± 0.99 | 97.20 ± 1.33 | 0.95 ± 0.02 | 80.28 ± 2.37 | 79.66 ± 2.98 | 80.88 ± 2.33 | 0.60 ± 0.05 | 80.40 ± 4.92 | 80.37 ± 6.63 | 81.01 ± 5.34 | 0.61 ± 0.10 |
| Ctriad | 95.46 ± 1.08 | 96.20 ± 1.51 | 94.85 ± 1.87 | 0.91 ± 0.02 | 80.38 ± 2.01 | 81.07 ± 2.43 | 79.82 ± 2.08 | 0.61 ± 0.04 | 81.06 ± 4.83 | 81.80 ± 5.79 | 80.98 ± 5.55 | 0.62 ± 0.10 |
| QSO | 98.35 ± 0.63 | 98.35 ± 0.83 | 98.36 ± 0.93 | 0.97 ± 0.01 | 80.15 ± 1.91 | 80.11 ± 2.25 | 80.27 ± 2.16 | 0.60 ± 0.04 | 81.42 ± 4.00 | 81.86 ± 5.21 | 81.58 ± 5.00 | 0.63 ± 0.08 |
| PseAAC | 98.51 ± 0.62 | 98.63 ± 0.85 | 98.42 ± 1.05 | 0.97 ± 0.01 | 81.13 ± 1.89 | 81.19 ± 2.24 | 81.14 ± 2.09 | 0.62 ± 0.04 | 81.40 ± 4.66 | 81.13 ± 5.48 | 82.19 ± 5.54 | 0.63 ± 0.09 |
Summary of performance of QSAR models for predicting the oligomeric state of FPs (99% homologous sequence reduction) using the J48 algorithm
| Descriptors | Training set | Tenfold CV set | External set | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ac (%) | Sn (%) | Sp (%) | MCC | Ac (%) | Sn (%) | Sp (%) | MCC | Ac (%) | Sn (%) | Sp (%) | MCC | |
| AAC/DPC/TPC | 98.22 ± 0.71 | 98.73 ± 0.89 | 97.69 ± 1.20 | 0.97 ± 0.01 | 79.40 ± 2.75 | 80.78 ± 1.86 | 77.98 ± 3.20 | 0.59 ± 0.06 | 80.78 ± 5.72 | 82.12 ± 6.28 | 80.12 ± 7.50 | 0.62 ± 0.12 |
| AC | 98.22 ± 0.71 | 98.73 ± 0.89 | 97.69 ± 1.20 | 0.96 ± 0.01 | 72.88 ± 3.46 | 74.52 ± 3.65 | 71.20 ± 3.74 | 0.46 ± 0.07 | 72.73 ± 6.11 | 74.89 ± 6.72 | 71.43 ± 7.57 | 0.46 ± 0.12 |
| CTD | 97.66 ± 0.90 | 98.06 ± 1.17 | 97.29 ± 1.46 | 0.95 ± 0.02 | 74.40 ± 3.03 | 75.01 ± 3.32 | 73.92 ± 3.42 | 0.49 ± 0.06 | 74.89 ± 5.79 | 75.76 ± 6.64 | 74.83 ± 6.95 | 0.50 ± 0.12 |
| Ctriad | 95.25 ± 1.87 | 96.62 ± 1.57 | 94.00 ± 3.64 | 0.91 ± 0.04 | 73.66 ± 2.84 | 75.78 ± 3.08 | 71.54 ± 3.15 | 0.47 ± 0.06 | 74.22 ± 6.02 | 77.26 ± 7.06 | 72.19 ± 7.21 | 0.49 ± 0.12 |
| QSO | 98.50 ± 0.64 | 98.63 ± 1.00 | 98.39 ± 1.15 | 0.97 ± 0.01 | 75.78 ± 2.71 | 77.37 ± 3.08 | 74.11 ± 2.74 | 0.51 ± 0.05 | 76.71 ± 5.27 | 78.92 ± 6.19 | 75.19 ± 6.35 | 0.54 ± 0.11 |
| PseAAC | 98.17 ± 0.74 | 98.38 ± 1.08 | 97.98 ± 1.36 | 0.96 ± 0.01 | 74.35 ± 3.00 | 75.87 ± 2.83 | 72.81 ± 3.67 | 0.49 ± 0.06 | 74.71 ± 5.74 | 77.17 ± 6.60 | 72.97 ± 6.93 | 0.50 ± 0.12 |
Summary of performance of QSAR models for predicting the oligomeric state of FPs (95% homologous sequence reduction) using the J48 algorithm
| Descriptors | Training set | Tenfold CV set | External set | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ac (%) | Sn (%) | Sp (%) | MCC | Ac (%) | Sn (%) | Sp (%) | MCC | Ac (%) | Sn (%) | Sp (%) | MCC | |
| AAC/DPC/TPC | 97.54 ± 1.19 | 99.25 ± 0.90 | 94.85 ± 2.50 | 0.95 ± 0.03 | 72.13 ± 4.18 | 79.83 ± 3.66 | 61.03 ± 5.34 | 0.42 ± 0.09 | 72.89 ± 7.08 | 79.85 ± 6.92 | 64.16 ± 11.20 | 0.43 ± 0.15 |
| AC | 98.35 ± 0.87 | 99.31 ± 0.92 | 96.81 ± 1.95 | 0.97 ± 0.02 | 70.71 ± 4.45 | 77.73 ± 3.63 | 59.80 ± 6.10 | 0.38 ± 0.09 | 70.30 ± 8.55 | 77.40 ± 7.91 | 60.99 ± 13.19 | 0.38 ± 0.18 |
| CTD | 97.97 ± 1.06 | 98.33 ± 1.40 | 97.50 ± 1.95 | 0.96 ± 0.02 | 69.40 ± 4.95 | 75.24 ± 4.39 | 60.62 ± 6.33 | 0.39 ± 0.10 | 70.18 ± 7.79 | 75.54 ± 7.39 | 63.17 ± 12.39 | 0.38 ± 0.17 |
| Ctriad | 96.62 ± 1.33 | 98.07 ± 1.52 | 94.35 ± 2.89 | 0.93 ± 0.03 | 68.64 ± 5.99 | 76.28 ± 4.49 | 57.20 ± 8.12 | 0.34 ± 0.12 | 71.26 ± 8.36 | 78.04 ± 7.20 | 62.51 ± 12.24 | 0.40 ± 0.17 |
| QSO | 98.10 ± 1.08 | 98.55 ± 1.25 | 97.42 ± 2.33 | 0.96 ± 0.02 | 68.98 ± 4.21 | 76.15 ± 3.45 | 57.59 ± 5.63 | 0.34 ± 0.09 | 69.93 ± 6.90 | 77.19 ± 5.75 | 60.30 ± 11.15 | 0.37 ± 0.14 |
| PseAAC | 98.24 ± 0.92 | 98.38 ± 1.30 | 98.07 ± 1.71 | 0.96 ± 0.02 | 69.39 ± 4.97 | 76.34 ± 3.98 | 58.20 ± 6.67 | 0.35 ± 0.10 | 69.67 ± 8.03 | 76.92 ± 7.01 | 59.53 ± 10.36 | 0.36 ± 0.17 |
Fig. 2Box plot of the feature usage from the predictive model of FP oligomerization. Features with the highest usage is deemed to be the most important
Fig. 3Screenshot of the osFP web server. Shown are the web server before (a) and after (b) prediction