| Literature DB >> 23282103 |
Hui-Ling Huang1, Phasit Charoenkwan, Te-Fen Kao, Hua-Chin Lee, Fang-Lin Chang, Wen-Lin Huang, Shinn-Jang Ho, Li-Sun Shu, Wen-Liang Chen, Shinn-Ying Ho.
Abstract
BACKGROUND: Existing methods for predicting protein solubility on overexpression in Escherichia coli advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a number of feature types such as physicochemical properties, amino acid and dipeptide composition, accompanied with feature selection. It is desirable to develop a simple and easily interpretable method for predicting protein solubility, compared to existing complex SVM-based methods.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23282103 PMCID: PMC3521471 DOI: 10.1186/1471-2105-13-S17-S3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The system flowchart of the proposed scoring matrix method.
The performance of SCM with different pairs of weights on two data sets Sd957 and SOLproDB.
| Classifier (W1, W2) | Sd957 | SOLproDB | ||||
|---|---|---|---|---|---|---|
| Training (%) | Test (%) | R | Training (%) | Test (%) | R | |
| SCM (0.8, 0.2) | 83.52 | 82.72 | 0.981 | 59.99 | 58.99 | 1.000 |
| SCM (0.9, 0.1) | 84.47 | 84.29 | 0.953 | 59.99 | 58.99 | 1.000 |
| SCM (1.0, 0) | 84.99 | 87.43 | 0.682 | 63.85 | 60.00 | 0.776 |
R is a correlation coefficient.
10 independent runs of the scoring card method on Sd957.
| Fitness | Training (%) | Test (%) | Sensitivity | Specificity | AUC | R | Threshold | |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.905 | 83.159 | 81.675 | 0.640 | 0.879 | 0.893 | 0.954 | 463.746 |
| 2 | 0.903 | 83.420 | 83.246 | 0.740 | 0.865 | 0.894 | 0.940 | 473.867 |
| 3 | 0.906 | 83.681 | 85.864 | 0.760 | 0.894 | 0.897 | 0.943 | 461.951 |
| 4 | 0.906 | 84.465 | 84.293 | 0.740 | 0.879 | 0.894 | 0.953 | 463.787 |
| 5 | 0.900 | 82.507 | 83.246 | 0.780 | 0.851 | 0.883 | 0.968 | 455.756 |
| 6 | 0.900 | 83.943 | 86.911 | 0.820 | 0.887 | 0.884 | 0.962 | 457.543 |
| 7 | 0.902 | 83.943 | 84.293 | 0.720 | 0.887 | 0.888 | 0.959 | 464.317 |
| 8 | 0.899 | 83.159 | 83.770 | 0.680 | 0.894 | 0.885 | 0.957 | 464.933 |
| 9 | 0.901 | 82.507 | 83.246 | 0.760 | 0.858 | 0.885 | 0.966 | 460.600 |
| 10 | 0.902 | 82.898 | 83.246 | 0.720 | 0.872 | 0.885 | 0.970 | 457.866 |
| Mean | 0.902 | 83.368 | 83.979 | 0.736 | 0.877 | 0.889 | 0.957 | 462.437 |
| Std. Dev. | 0.003 | 0.646 | 1.485 | 0.051 | 0.015 | 0.005 | 0.010 | 5.132 |
The experiment 4 having the best training accuracy is used for future analysis.
Performance comparisons between SCM and SVM using the same dipeptide composition.
| Classifier | Sd957 | SOLproDB | ||
|---|---|---|---|---|
| Training (%) | Test (%) | Training (%) | Test (%) | |
| SCM | 84.47 | 84.29 | 59.99 | 58.99 |
| SVM | 85.38 | 84.29 | 65.35 | 62.49 |
Figure 2Heat map of the optimized solubility scoring matrix of dipeptides.
Figure 3The histogram of sequence solubility scores in the test data set. (a) statistical SSM without optimization (b) optimized SSM.
Figure 4The test accuracies for various sizes of uncertainty regions.
The cross-performance comparison between SCM and SOLpro.
| Classifier | Sd957 | SOLproDB |
|---|---|---|
| SCM | 84.29% | 53.90% |
| SOLpro1 | 49.21% | 74.15%2 |
1SOLpro classifier is downloaded from the SOLpro website http://solpro.proteomics.ics.uci.edu/.
2The 10-fold cross-validation accuracy of SOLpro using SOLproDB is obtained from the SOLpro publication [11].
The initial solubility scoring matrix of amino acids.
| Amino acid | Score | KUMS000103 (%) | |
|---|---|---|---|
| A-Ala | 494.3 | 1.39 | 14.1 |
| E-Glu | 445.6 | 1.35 | 8.8 |
| D-Asp | 386.2 | 0.89 | 5.7 |
| K-Lys | 358.5 | 1.11 | 7.7 |
| M-Met | 333.6 | 1.21 | 3.3 |
| L-Leu | 357.3 | 1.32 | 9.1 |
| F-Phe | 320.6 | 1.01 | 5 |
| V-Val | 362.8 | 0.89 | 5.9 |
| P-Pro | 319.5 | 0.5 | 0.7 |
| I-Ile | 360.4 | 1.04 | 7.1 |
| H-His | 317.9 | 0.92 | 2 |
| Q-Gln | 326.1 | 1.29 | 3.7 |
| R-Arg | 347.1 | 1.17 | 5.5 |
| T-Thr | 333.0 | 0.76 | 4.4 |
| N-Asn | 311.4 | 0.77 | 3.2 |
| Y-Tyr | 293.7 | 0.95 | 4.5 |
| S-Ser | 265.4 | 0.82 | 3.9 |
| G-Gly | 313.4 | 0.47 | 4.1 |
| C-Cys | 303.0 | 0.74 | 0.1 |
| W-Trp | 306.5 | 1.06 | 1.2 |
Figure 5The correlation coefficient R = 0.51 between the optimized SSM of amino acids and the α-helical propensity.
Selected R values between some interesting physicochemical properties and the optimized SSM of amino acids.
| Property | Keyword | R (Max | Avg | Min | Var) | |
|---|---|---|---|---|---|---|
| R: 0.76 | KUMS000103 Distribution of amino acid residues in the α-helices in thermophilic proteins | |||||
| 0.54 | PRAM900102 Relative frequency in α-helix | |||||
| -0.42 | MUNV940102 Free energy in α-helical region | |||||
| Hydrophilicity | Hydrophilic | 2 | R (0.38 | 0.27 | 0.16 | 0.02) |
| R: 0.38 | HOPT810101 Hydrophilicity value | |||||
| 0.16 | KUHL950101 Hydrophilicity scale | |||||
| Hydrophobicity | Hydrophobic | 36 | R (0.35 | -0.09 | -0.30 | 0.03) |
| R: 0.35 | LEVM760101 Hydrophobic parameter | |||||
| -0.13 | CIDH920103 Normalized hydrophobicity scales for α+β-proteins | |||||
| -0.30 | CASG920101 Hydrophobicity scale from native protein structures | |||||
| Turn | Turn | 26 | R (0.38 | -0.22 | -0.57 | 0.06) |
| R: 0.38 | OOBM850102 Optimized propensity to form reverse turn | |||||
| -0.19 | PALJ810116 Normalized frequency of turn in α/β class | |||||
| -0.57 | ROBB760108 Information measure for turn | |||||
| Charge | Charge | 5 | R (0.59 | -0.01 | -0.43 | 0.2) |
| R: 0.59 | FAUJ880112 Negative charge | |||||
| -0.07 | FAUJ880111 Positive charge | |||||
| -0.43 | CHAM830108 A parameter of charge transfer donor capability | |||||
| Thermophile | thermophile | 6 | R (0.76 | 0.52 | 0.33 | 0.2) |
| R: 0.76 | KUMS000103 Distribution of amino acid residues in the α-helices in thermophilic proteins | |||||
| 0.56 | FUKS010109 Entire chain composition of amino acids in intracellular proteins of thermophiles (percent) | |||||
| 0.33 | FUKS010105 Interior composition of amino acids in intracellular proteins of thermophiles (percent) | |||||
The descriptions (definition) of AAindex ID are obtained from the AAindex database [16].
Figure 6The correlation coefficient R = 0.76 between the optimized SSM of amino acids and the property KUMS000103, the distribution of residues in the α-helices in thermophilic proteins.
Figure 7Distribution of dipeptide scores on the positions of two typical sequences. The protein 1FSZ_A with length 372 has a solubility score 499.92 predicted as a soluble protein, and Q5FZH9 with length 352 has a score 383.73 predicted as an insoluble protein where the threshold value is 463.79.