| Literature DB >> 26931649 |
Catherine Ching Han Chang1,2, Chen Li2, Geoffrey I Webb3, BengTi Tey1,4, Jiangning Song2,3,5, Ramakrishnan Nagasundara Ramanan1,4,6.
Abstract
Periplasmic expression of soluble proteins in Escherichia coli not only offers a much-simplified downstream purification process, but also enhances the probability of obtaining correctly folded and biologically active proteins. Different combinations of signal peptides and target proteins lead to different soluble protein expression levels, ranging from negligible to several grams per litre. Accurate algorithms for rational selection of promising candidates can serve as a powerful tool to complement with current trial-and-error approaches. Accordingly, proteomics studies can be conducted with greater efficiency and cost-effectiveness. Here, we developed a predictor with a two-stage architecture, to predict the real-valued expression level of target protein in the periplasm. The output of the first-stage support vector machine (SVM) classifier determines which second-stage support vector regression (SVR) classifier to be used. When tested on an independent test dataset, the predictor achieved an overall prediction accuracy of 78% and a Pearson's correlation coefficient (PCC) of 0.77. We further illustrate the relative importance of various features with respect to different models. The results indicate that the occurrence of dipeptide glutamine and aspartic acid is the most important feature for the classification model. Finally, we provide access to the implemented predictor through the Periscope webserver, freely accessible at http://lightning.med.monash.edu/periscope/.Entities:
Mesh:
Substances:
Year: 2016 PMID: 26931649 PMCID: PMC4773868 DOI: 10.1038/srep21844
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Overview of the Periscope development flowchart.
Performance comparison of primary classifiers developed using different machine learning algorithms.
| Algorithm | LIBSVM | RBFNetwork | RF |
|---|---|---|---|
| Average accuracy | 0.7647 | 0.7727 | |
| Error rate | 0.2353 | 0.2273 | |
| Precision | 0.6098 | 0.6085 | |
| Recall | 0.6218 | 0.5390 | |
| F1 score | 0.6297 | 0.5717 | |
| MCC | 0.4371 | 0.3623 |
Performance was evaluated based on repeated 10-times 10-fold CV. The highest score within the same performance measure category is indicated in bold font with the exception of error rate where the lowest score is indicated in bold font.
Performance of second-stage regression models.
| Regression model | PCC | MAE | RMSE |
|---|---|---|---|
| Low | 0.6934 | 0.0728 | 0.0845 |
| Medium | 0.5386 | 9.81 | 16.91 |
| High | 0.9381 | 425.81 | 593.54 |
Feature subsets selected for both classification and regression tasks and the description of respective features.
| Task | Selected features | Description |
|---|---|---|
| CLASSIFICATION | BPC | Occurrence frequency of |
| Sulfur | Occurrence frequency of | |
| MCBPC | ||
| logPFR | ||
| CL | Occurrence of dipeptide cysteine and leucine | |
| QD | Occurrence of dipeptide glutamine and aspartic acid | |
| VE | Occurrence of dipeptide valine and glutamic acid | |
| REGRESSION HIGH | TP | Occurrence of dipeptide threonine and proline |
| VT | Occurrence of dipeptide valine and threonine | |
| T × MCPhe | Occurrence frequency of threonine interacting with | |
| REGRESSION MEDIUM | ER | Occurrence of dipeptide glutamic acid and arginine |
| WQ | Occurrence of dipeptide tryptophan and glutamine | |
| VT | Occurrence of dipeptide valine and threonine | |
| R × AbsCharge | Occurrence frequency of arginine interacting with | |
| ANC × MCAliphatic | Occurrence frequency of | |
| MCCys × pI | ||
| REGRESSION LOW | F × logPFR | Occurrence frequency of phenylalanine interacting with |
| S × MCNPH | Occurrence frequency of serine interacting with | |
| y × transmembrane | Occurrence of tyrosine interacting with occurrence of transmembrane, predicted using TMHMM | |
| Y × nlogPFR | Occurrence frequency of tyrosine interacting with |
Relative significance of features selected (descending order) for primary classification task.
| Feature removed | Percentage change | ||||
|---|---|---|---|---|---|
| Accuracy | Precision | Recall | F1score | MCC | |
| QD | −9.25 | −49.94 | −52.85 | −43.20 | −81.10 |
| BPC | −5.28 | −25.94 | −37.22 | −20.52 | −47.77 |
| Sulfur | −3.66 | −10.74 | −30.53 | −8.62 | −31.09 |
| VE | −3.35 | −6.15 | −29.40 | −5.76 | −26.70 |
| MCBPC | −2.13 | −12.07 | −30.18 | −8.97 | −26.28 |
| logPFR | −1.12 | −2.26 | −24.46 | −0.33 | −14.81 |
| CL | 1.02 | 5.38 | −19.80 | 6.52 | −2.85 |
Percentage changes were evaluated using repeated 10 times 10-fold CV, with reference to the classification model trained using all seven features selected.
Relative significance of features selected (descending order) for second-level regression models.
| Regression Models | Feature removed | Percentage change | ||
|---|---|---|---|---|
| PCC | MAE | RMSE | ||
| High | T × MCPhe | −3.96 | 39.83 | 36.49 |
| TP | −2.32 | 24.85 | 14.90 | |
| VT | −1.98 | 22.43 | 26.52 | |
| Medium | WQ | −52.35 | 55.83 | 40.49 |
| ANC × MCAliphatic | −47.28 | 35.51 | 27.50 | |
| ER | −35.52 | 26.77 | 16.13 | |
| R × AbsCharge | −15.74 | 6.28 | 7.41 | |
| VT | −2.04 | 13.28 | 3.71 | |
| MCCys × pI | −4.90 | 6.88 | 3.49 | |
| Low | Y × nlogPFR | −36.06 | 15.54 | 17.17 |
| S × MCNPH | −2.54 | 11.31 | 5.08 | |
| y × transmembrane | 6.02 | 5.60 | 0.76 | |
| F × logPFR | 2.90 | 0.36 | −2.14 | |
Percentage changes were evaluated using LOOCV, with reference to the high, medium and low regression models trained using respective feature subsets.
Experimental and predicted expression data of independent test dataset.
| Protein | Signal peptide | Experimental results | Predicted results from Periscope | |||
|---|---|---|---|---|---|---|
| Expression level | Yield (mg/l) | Expression level | Expression level classification matrix [High,Low,Medium] | Yield (mg/l) | ||
| VHHs B5.2 | ompA | Medium | 6.7 | Medium | 0.09,0.15,0.75 | 6.8009 |
| scFv13.R4 | TorA | Low | 0.06 | Medium | 0.09,0.27,0.64 | 4.6039 |
| Human protein disulfide isomerase (hPDI) | modified from ompA | Medium | 30 | Medium | 0.10,0.18,0.72 | 29.8987 |
| Granulocyte-macrophage colony-stimulating factor (GM-CSF) | CSP | High | 800 | Medium | 0.06,0.33,0.61 | 14.7816 |
| VHHs A5.1 | ompA | Medium | 55.5 | Medium | 0.05,0.20,0.75 | 5.3871 |
| Maltose-binding protein (MBP) | native | Medium | 9.8 | Medium | 0.39,0.12,0.49 | 11.6017 |
| human epidermal growth factor (hEGF) | phoA | Medium | 1.026 | Medium | 0.25,0.35,0.40 | 11.2143 |
| enzymatically active version of tissue plasminogen activator (vtPA) | stII | Low | 0.159 | Medium | 0.08,0.44,0.48 | 9.2542 |
| Cellulose binding domain (CBD) | Cex | High | 5310 | Medium | 0.16,0.30,0.54 | 2.013 |
| Exotoxin A from | ompA | Medium | 60 | Medium | 0.39,0.21,0.40 | 45.1823 |
| VHHs A19.2 | ompA | Medium | 3.8 | Medium | 0.06,0.13,0.81 | 4.0973 |
| Single-chain antibody Fv fragment (scFv) | mBiP | High | 115 | Medium | 0.22,0.38,0.41 | 12.0768 |
| VHHs B7.3 | ompA | Medium | 1.5 | Medium | 0.06,0.31,0.63 | 4.4261 |
| Glutaminase from | ompA | Medium | 80 | Medium | 0.10,0.25,0.65 | 44.384 |
| elicitin beta-cinnamomin from | pelB | Medium | 13.3 | Medium | 0.09,0.29,0.62 | 7.2501 |