| Literature DB >> 35765650 |
Kulandai Arockia Rajesh Packiam1, Chien Wei Ooi1,2, Fuyi Li3, Shutao Mei4, Beng Ti Tey1,2, Huey Fang Ong5, Jiangning Song4,6, Ramakrishnan Nagasundara Ramanan1.
Abstract
Optimization of the fermentation process for recombinant protein production (RPP) is often resource-intensive. Machine learning (ML) approaches are helpful in minimizing the experimentations and find vast applications in RPP. However, these ML-based tools primarily focus on features with respect to amino-acid-sequence, ruling out the influence of fermentation process conditions. The present study combines the features derived from fermentation process conditions with that from amino acid-sequence to construct an ML-based model that predicts the maximal protein yields and the corresponding fermentation conditions for the expression of target recombinant protein in the Escherichia coli periplasm. Two sets of XGBoost classifiers were employed in the first stage to classify the expression levels of the target protein as high (>50 mg/L), medium (between 0.5 and 50 mg/L), or low (<0.5 mg/L). The second-stage framework consisted of three regression models involving support vector machines and random forest to predict the expression yields corresponding to each expression-level-class. Independent tests showed that the predictor achieved an overall average accuracy of 75% and a Pearson coefficient correlation of 0.91 for the correctly classified instances. Therefore, our model offers a reliable substitution of numerous trial-and-error experiments to identify the optimal fermentation conditions and yield for RPP. It is also implemented as an open-access webserver, PERISCOPE-Opt (http://periscope-opt.erc.monash.edu).Entities:
Keywords: AUC, area under the curve; CV, cross-validation; CfsSubsetEval, Correlation-based Forward Selection Subset Evaluator; ClassifierSubsetEval, Classifier Subset Evaluator; E. coli, Escherichia coli; Escherichia coli; FC1, Feature Category 1; FC2, Feature Category 2; FC3, Feature Category 3; FC4, Feature Category 4; IPTG, isopropyl β-D-1-thiogalactopyranoside; LOOCV, Leave-one-out cross-validation; MAE, mean absolute error; MCC, Mathew correlation coefficient; ML, machine learning; MLR, machine learning in R; Machine learning; OD, optical density at 600 nm; Optimization; PCC, Pearson correlation coefficient; Periplasmic expression; Prediction model; RF, random forest; RFR, RF regression; RFR-High, RFR for high; RFR-Medium, RFR for medium; RMSE, root mean squared error; RPP, Recombinant protein production; RSM, response surface methodology; Recombinant protein production; SMOTE, Synthetic Minority Over-sampling Technique; SP, signal peptides; SVM, support vector machines; SVR, SVM regression; SVR-Low, SVR for class: "low"; XGB, XGBoost; pI, isoelectric point
Year: 2022 PMID: 35765650 PMCID: PMC9201004 DOI: 10.1016/j.csbj.2022.06.006
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1Framework of the proposed prediction model. Low: yield is<0.5 mg/L, Medium: yield is between 0.5 and 50 mg/L, High: yield is higher than 50 mg/L. Non-medium refers to both High and Low together.
Fig. 2Feature importance for a) XGB Classifier 1b) XGB Classifier 2. Performance of the model has been evaluated using ten times 10-fold cross validation (100 experiments).
Fig. 3Feature importance for a) SVR-Low b) RFR-Medium c) RFR-High. Performance of the model has been evaluated using ten times 10-fold cross validation (100 experiments).
Fig. 4Benchmarkingofthe performance of different algorithms. a) Classification tasks for both training and testing datasets b) Regression tasks for both training and testing datasets.
Classification Task – Benchmarking with three algorithms.
| Algorithm | Classifier 1 | Classifier 2 | ||||
|---|---|---|---|---|---|---|
| Selected number of features | 4 | 8 | 16 | 1 | 4 | 6 |
| Accuracy (%) | 81.36 | 76.45 | 63.35 | – | 77.27 | 68.18 |
| Error rate (%) | 18.63 | 23.55 | 36.65 | – | 22.73 | 31.82 |
| Precision | 0.814 | 0.764 | 0.636 | – | 0.776 | 0.682 |
| Recall | 0.814 | 0.764 | 0.634 | – | 0.773 | 0.682 |
| F-measure | 0.813 | 0.764 | 0.629 | – | 0.773 | 0.682 |
| MCC | 0.626 | 0.527 | 0.267 | – | 0.549 | 0.362 |
| AUC | 0.913 | 0.788 | 0.643 | – | 0.791 | 0.747 |
Performance of the model has been evaluated using leave-one-out cross validation (LOOCV).
Since there is only one key feature selected, further model training is neither essential nor meaningful in this case.
Regression task – Benchmarking with three algorithms.
| Regression – Low | Regression – Medium | Regression – High | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Selected number of features | 5 | 5 | 12 | 12 | 12 | 14 | 9 | 6 | 10 |
| Pearson Correlation Coefficient (PCC) | 0.7891 | 0.7103 | 0.8288 | 0.8971 | 0.7574 | 0.8623 | 0.8664 | 0.8534 | 0.8137 |
| Mean Absolute Error (MAE) | 0.0738 | 0.2759 | 0.0623 | 3.6673 | 8.8066 | 4.6152 | 47.4097 | 114.973 | 47.0493 |
| Root Mean Squared Error (RMSE) | 0.0944 | 0.2993 | 0.0887 | 5.7796 | 13.5347 | 6.6317 | 76.694 | 163.13 | 90.4669 |
Performance of the model has been evaluated using leave-one-out cross validation (LOOCV).
Predicted yields at the given experimental conditions.
| SP-protein combination | Process conditions | Actual expression | Predicted expression | ||||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | pel-B-eGFP | 0.5 | 0.1 | 18 | 4 | 4.7 | M | 3.0 | M |
| 2 | Cex-eGFP | 0.7 | 0.5 | 27 | 4 | 4.8 | M | 0.3 | *L |
| 3 | ompA-eGFP | 1 | 0.5 | 18 | 4 | 2.8 | M | 2.0 | M |
| 4 | ompC-eGFP | 0.7 | 0.5 | 18 | 4 | 5.7 | M | 2.6 | M |
| 5 | Lpp-eGFP | 0.4 | 0.1 | 18 | 4 | 0.6 | M | 1.6 | M |
| 6 | DmsA-eGFP | 1 | 1 | 18 | 4 | 80.1 | H | 20.7 | M |
| 7 | MdoD-eGFP | 0.4 | 0.5 | 27 | 4 | 14.2 | M | 6.2 | M |
| 8 | pel-B-TMT | 0.4 | 0.5 | 28 | 4 | 53.2 | H | 27.1 | *M |
| 9 | Cex-TMT | 0.4 | 1 | 38 | 4 | 137.5 | H | 120.8 | H |
| 10 | ompA-TMT | – | – | – | 4 | 0.0 | L | 0.0 | L |
| 11 | ompC-TMT | – | – | – | 4 | 0.0 | L | 0.0 | L |
| 12 | Lpp-TMT | 0.7 | 1 | 18 | 4 | 95.5 | H | 116.5 | H |
| 13 | DmsA-TMT | 0.4 | 0.7 | 28 | 4 | 482.2 | H | 175.3 | H |
| 14 | MdoD-TMT | 0.4 | 1 | 38 | 4 | 24.5 | M | 9.0 | M |
| 15 | pelB-IFN | 4 (TB) | 0.05 | 25 | 14 | 0.4 | L | 0.1 | L |
| 16 | pelB-VEGFR2-D3 | 1 | 1 | 37 | 20 | 2.0 | M | 4.1 | M |
| 17 | pho-rhES | 0.6 | 0.3 | 25 | 13.57 | 2.2 | M | 1.8 | M |
| 18 | modspA-CALB | 1 | 12.5%(L) | 24 | 15 h | 234.0 | H | 126.5 | H |
| 19 | MBP-6 × His-U24 | 0.5–1.0 | 0.3 | 18 | 16 | 2.8 | M | 3.0 | M |
| 20 | Pel-B-SynVNAR-A6 | 0.5 | 0.1 | 18 | 21 | 27.0 | M | 7.5 | M |
| 21 | modBlaasp-hAct A | 0.6 | 1 | 37 | 8 | 150.0 | H | 0.0 | *L |
| 22 | CusF-GFP | 0.5 | 0.1 | 12 | 25 | 8.0 | M | 5.2 | M |
| 23 | ecotin-HArbd | 0.6–0.8 | 0.4–1 | 30 | ‘8–10 | 10.0 | M | 13.3 | M |
| 24 | mBiP-scFv | 0.5 | 0.2 | 30 | 5 | 115.0 | H | 7.3 | *M |
| 25 | pelB-scFv-dmOKT3 | 0.8 | 0.1 | 22–24 | 18–20 | 0.2 | L | 13.3 | *M |
| 26 | stII-vtPA | 0.5 | 1 | 30 | 6 | 0.2 | L | 0.1 | L |
| 27 | LTIIb-B-CT-B | 0.3 | 0.02 | 37 | 6 | 190.0 | H | 8.3 | *M |
| 28 | pelB-rPA | 0.7 | 1 | 24 | 21 | 0.0 | L | 0.2 | L |
The misclassified instances are represented by an asterisk (*). TB – Terrific broth (medium); L – Lactose (inducer).
Predicted maximal predicted yields and the corresponding fermentation conditions.
| No | SP-protein combination | Experimental | Predicted | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | pel-B-eGFP | 0.5 | 0.1 | 18 | 4 | 4.7 | M | 1 | 0.1 | 20 | 4 | 3.8 | M | |||
| 2 | Cex-eGFP | 0.7 | 0.5 | 27 | 4 | 4.8 | M | 0.7 | 0.5 | 25 | 16 | 4.6 | M | |||
| 3 | ompA-eGFP | 1 | 0.5 | 18 | 4 | 2.8 | M | 0.7 | 1 | 25 | 24 | 1.9 | M | |||
| 4 | ompC-eGFP | 0.7 | 0.5 | 18 | 4 | 5.7 | M | 0.7 | 0.5 | 25 | 4 | 4.3 | M | |||
| 5 | Lpp-eGFP | 0.4 | 0.1 | 18 | 4 | 0.6 | M | 0.7 | 0.5 | 30 | 24 | 1.9 | M | |||
| 6 | DmsA-eGFP | 1 | 1 | 18 | 4 | 80.1 | H | 1 | 0.5 | 30 | 4 | 38.4 | *M | |||
| 7 | MdoD-eGFP | 0.4 | 0.5 | 27 | 4 | 14.2 | M | 0.4 | 0.5 | 30 | 8 | 8.1 | M | |||
| 8 | pel-B-TMT | 0.4 | 0.5 | 28 | 4 | 53.2 | H | 1 | 0.5 | 30 | 4 | 31.9 | *M | |||
| 9 | Cex-TMT | 0.4 | 1 | 38 | 4 | 137.5 | H | 0.4 | 1 | 37 | 24 | 141.5 | H | |||
| 10 | ompA-TMT | – | – | – | 4 | 0.0 | L | 1 | 1 | 30 | 24 | 0.1 | L | |||
| 11 | ompC-TMT | – | – | – | 4 | 0.0 | L | 0.4 | 1 | 20 | 4 | 26.0 | *M | |||
| 12 | Lpp-TMT | 0.7 | 1 | 18 | 4 | 95.5 | H | 0.4 | 1 | 37 | 24 | 151.6 | H | |||
| 13 | DmsA-TMT | 0.4 | 0.7 | 28 | 4 | 482.2 | H | 0.4 | 0.5 | 30 | 8 | 268.3 | H | |||
| 14 | MdoD-TMT | 0.4 | 1 | 38 | 4 | 24.5 | M | 0.4 | 1 | 37 | 8 | 16.3 | M | |||
The misclassified instances are represented by an asterisk symbol (*).
Selected features for prediction model.
| Feature category | Classification models | Regression models | |||
|---|---|---|---|---|---|
| FC1 | Occ_E | Seq_len | OF_E | OF_F | OF_DmE |
| MNC_A | Occ_V | OF_S | OF_M | ||
| OF_MNC_C | pI_Protpi | OF_Aliphatic | OF_MNC_A | ||
| OF_Aromatic | Helix_to_Sheet_PHD | OF_Hphil_ESG | OF_Aromatic | ||
| Expno_AA_TM | OF_Hphil_KD | OF_Hphil_ESG | |||
| Coil_PHD | Pred_Sol | ||||
| FC2 | – | – | – | – | – |
| FC3 | OF_Hphil_ESG×Sheet_PHD | – | – | – | Occ_N×MNC_Y |
| Occ_Y×OF_DmE | |||||
| FC4 | Temperature | – | OD | Temperature | IPTG |
| ODxTime | Temperature | OD×IPTG | Temperature | ||
| OD×Temp | OD×Temp | OD×Temp | |||
| OD×Time | OD×Time | OD×Time | |||
| IPTG×Temp | IPTG×Temp | IPTG×Temp | |||
| Temp×Time | Temp×Time | Temp×Time | |||
| Selected features | 8 | 4 | 12 | 12 | 9 |