| Literature DB >> 33416864 |
Jiri Hon1,2,3, Martin Marusiak3, Tomas Martinek3, Antonin Kunka1,2, Jaroslav Zendulka3, David Bednar1,2, Jiri Damborsky1,2.
Abstract
MOTIVATION: Poor protein solubility hinders the production of many therapeutic and industrially useful proteins. Experimental efforts to increase solubility are plagued by low success rates and often reduce biological activity. Computational prediction of protein expressibility and solubility in Escherichia coli using only sequence information could reduce the cost of experimental studies by enabling prioritisation of highly soluble proteins.Entities:
Keywords: machine-learning; prediction; protein mining; protein solubility; soluble expression
Year: 2021 PMID: 33416864 PMCID: PMC8034534 DOI: 10.1093/bioinformatics/btaa1102
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Performance of various solubility predictors using the balanced SoluProt test set of 3100 sequences
| Method | AUC | T | ACC | MCC | TP | TN | FP | FN |
|---|---|---|---|---|---|---|---|---|
| SoluProt | 0.62 | 0.50 | 58.5% | 0.17 | 939 | 873 | 677 | 611 |
| PROSO II | 0.60 | 0.60 | 58.0% | 0.17 | 630 | 1167 | 383 | 920 |
| SWI | 0.60 | 0.50 | 55.9% | 0.13 | 1206 | 527 | 1023 | 344 |
| CamSol | 0.57 | 1.00 | 54.1% | 0.08 | 676 | 1001 | 549 | 874 |
| ESPRESSO | 0.56 | 0.50 | 53.8% | 0.08 | 1003 | 664 | 886 | 547 |
| rWH | 0.55 | 0.50 | 54.0% | 0.08 | 670 | 1005 | 545 | 880 |
| DeepSol | 0.55 | 0.50 | 52.9% | 0.09 | 230 | 1409 | 141 | 1320 |
| Protein-Sol | 0.54 | 0.45 | 51.6% | 0.03 | 1056 | 544 | 1006 | 494 |
| SOLpro | 0.53 | 0.50 | 52.0% | 0.04 | 654 | 959 | 591 | 896 |
| SKADE | 0.51 | 0.50 | 49.2% | –0.03 | 159 | 1366 | 184 | 1391 |
| ccSOL omics | 0.51 | 0.50 | 50.8% | 0.02 | 884 | 690 | 860 | 666 |
| RPSP | 0.50 | 0.50 | 49.8% | 0.00 | 501 | 1044 | 506 | 1049 |
Note: The different definitions of solubility and target expression system (Supplementary Table S1) should be considered when comparing the performance of individual tools.
AUC—area under the ROC curve, T—threshold for the soluble class, ACC—accuracy, MCC—Matthew’s correlation coefficient, TP—true positives, TN—true negatives, FP—false positives, FN—false negatives.
Fig. 1.Receiver operating curves (ROC) calculated for the balanced SoluProt test set of 3100 sequences. The predictors are ordered by the area under the receiver operating curve (AUC)
Overlaps between the SoluProt test set and available training sets
| Dataset | Size | Test set overlap | TP | TN | FP | FN |
|---|---|---|---|---|---|---|
|
| 129643 | 2952 (95.2%) | 951 | 1437 | 50 | 514 |
| DeepSol/SKADE | 69420 | 2294 (74.0%) | 737 | 1130 | 67 | 360 |
| SWI | 12216 | 820 (26.5%) | 537 | 210 | 53 | 20 |
| SOLpro | 17408 | 480 (15.5%) | 178 | 120 | 39 | 143 |
Note: Two sequences were considered identical if their global sequence identity reported by USEARCH was 100%. Differences in solubility annotations for identical sequences were quantified using confusion matrix terms (TP, TN, FP and FN). The solubility annotations of the SoluProt test set are assumed to reflect the true solubilities of the proteins.
TP—true positives, TN—true negatives, FP—false positives, FN—false negatives. a DeepSol and SKADE share the same training set.
Fig. 2.Increases in the number of true positives resulting from sequence prioritization using the tested solubility prediction tools. The SoluProt test set sequences were ordered by predicted solubility based on each predictor’s output, and a variable percentage of the sequences with the worst predicted solubility was then removed. The increase in the number of true positives was then calculated relative to a baseline random selection. For example, upon randomly removing 90% of the test set sequences (2790 samples), we would expect half of the remaining 310 sequences to be true positives