Literature DB >> 35765650

PERISCOPE-Opt: Machine learning-based prediction of optimal fermentation conditions and yields of recombinant periplasmic protein expressed in Escherichia coli.

Kulandai Arockia Rajesh Packiam¹, Chien Wei Ooi^1,2, Fuyi Li³, Shutao Mei⁴, Beng Ti Tey^1,2, Huey Fang Ong⁵, Jiangning Song^4,6, Ramakrishnan Nagasundara Ramanan¹.

Abstract

Optimization of the fermentation process for recombinant protein production (RPP) is often resource-intensive. Machine learning (ML) approaches are helpful in minimizing the experimentations and find vast applications in RPP. However, these ML-based tools primarily focus on features with respect to amino-acid-sequence, ruling out the influence of fermentation process conditions. The present study combines the features derived from fermentation process conditions with that from amino acid-sequence to construct an ML-based model that predicts the maximal protein yields and the corresponding fermentation conditions for the expression of target recombinant protein in the Escherichia coli periplasm. Two sets of XGBoost classifiers were employed in the first stage to classify the expression levels of the target protein as high (>50 mg/L), medium (between 0.5 and 50 mg/L), or low (<0.5 mg/L). The second-stage framework consisted of three regression models involving support vector machines and random forest to predict the expression yields corresponding to each expression-level-class. Independent tests showed that the predictor achieved an overall average accuracy of 75% and a Pearson coefficient correlation of 0.91 for the correctly classified instances. Therefore, our model offers a reliable substitution of numerous trial-and-error experiments to identify the optimal fermentation conditions and yield for RPP. It is also implemented as an open-access webserver, PERISCOPE-Opt (http://periscope-opt.erc.monash.edu).

Entities: Chemical

Keywords: AUC, area under the curve; CV, cross-validation; CfsSubsetEval, Correlation-based Forward Selection Subset Evaluator; ClassifierSubsetEval, Classifier Subset Evaluator; E. coli, Escherichia coli; Escherichia coli; FC1, Feature Category 1; FC2, Feature Category 2; FC3, Feature Category 3; FC4, Feature Category 4; IPTG, isopropyl β-D-1-thiogalactopyranoside; LOOCV, Leave-one-out cross-validation; MAE, mean absolute error; MCC, Mathew correlation coefficient; ML, machine learning; MLR, machine learning in R; Machine learning; OD, optical density at 600 nm; Optimization; PCC, Pearson correlation coefficient; Periplasmic expression; Prediction model; RF, random forest; RFR, RF regression; RFR-High, RFR for high; RFR-Medium, RFR for medium; RMSE, root mean squared error; RPP, Recombinant protein production; RSM, response surface methodology; Recombinant protein production; SMOTE, Synthetic Minority Over-sampling Technique; SP, signal peptides; SVM, support vector machines; SVR, SVM regression; SVR-Low, SVR for class: "low"; XGB, XGBoost; pI, isoelectric point

Year: 2022 PMID： 35765650 PMCID： PMC9201004 DOI： 10.1016/j.csbj.2022.06.006

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 6.155

Introduction

Recombinant protein production (RPP) is a noteworthy biotechnological technique that finds rising applications in various sectors such as healthcare, detergents, food industry, and, most importantly, in research and development [1], [2]. Escherichia coli (E. coli) is considered an ideal host for RPP because it offers numerous advantages, including simple nutritional requirements, faster cellular growth, and easiness in achieving high cell densities [3], [4]. During RPP, the protein of interest can be directed towards the periplasmic space of E. coli by the use of a short amino acid sequence called signal peptides (SP). The periplasmic expression of recombinant proteins is preferred because the periplasmic space provides an oxidized environment that improves protein folding, especifically for proteins containing di-sulphide bonds. Moreover, the target proteins can be selectively recovered from the periplasmic space using the milder cell disruption steps that avoid the release of cytoplasmic content to the processing fluid [5], [6], [7], [8]. The increasing demands for recombinant proteins have driven the necessity of optimizing various fermentation process parameters to achieve the maximal RPP. Despite tremendous works devoted to the aspects of RPP and technological advancement, achieving the high yields of recombinant proteins remains a challenge. Moreover, optimization of RPP involves tedious, costly, and time-consuming experiments to identify the optimal fermentation conditions which are specific to each type of protein [3]. Machine learning (ML) techniques have emerged as a game-changer in many areas of research, including the field of biotechnology. Biotechnological processes such as RPP can be deciphered using ML-based prediction tools, which are developed using the available data to provide a rough estimate of the unknown biological responses. For instance, there are several notable ML-based prediction tools for various RPP-based applications, including the prediction of protein solubility, protein folding rates, and protein expression yields. PROSO II [9], ccSOL Omics [10], Protein–Sol [11], DeepSol [12], PaRSnIP [13], SoDoPE [14] and SolTranNet [15] are among the ML-based tools that predict the protein solubility with high accuracies. Similarly, the well-known ML-based prediction tools for determining the protein folding rates are K-Fold [16], Pred-PFR [17], PRORATE [18], and SeqRate [19]. Additionally, ESPRESSO [20] and Periscope [21] are the advanced ML tools that can predict the protein expression yields in the cytoplasm and periplasm of E. coli, respectively; both tools incorporate the concepts of previously developed models used in the determination of protein solubility and folding rate. The above-mentioned ML-based tools were developed mainly by using the key features associated with the amino acid sequences. This is because the amino acid sequence primarily influences the protein solubility and folding rates, which in turn affects many facets of RPP [3], [21]. A limited number of studies have also considered features from other aspects, such as gene sequences [22] and host strains and/or vectors [23]. Nevertheless, the ML models based on amino-acid sequence do not necessarily provide a clear notion of the optimized condition during RPP because the models exclude features related to the fermentation process conditions, which play vital roles in increasing the yields of RPP. The present study aimed to derive a global ML-based model capable of predicting the optimal protein yield and fermentation process conditions for a target recombinant protein to be expressed in the periplasmic space of E. coli. Two sets of XGBoost (XGB) classifiers were employed in the first stage to classify the target protein into high (>50 mg/L), medium (between 0.5 and 50 mg/L) or low (<0.5 mg/L) expression levels. In accordance with the classified level of protein expression, the predictions of optimal fermentation conditions and yields were then attained using three sets of regression models based on support vector machine (SVM) or random forest (RF). This ensembled model was developed by integrating data from an existing bioinformatics tool, Periscope, and the data from literature and in-house experiments; in total, 84 protein-types and 103 SP-protein combinations were used in this study. 11,985 features were initially extracted by considering all important factors associated with the amino acid sequence and fermentation process condition. Then, the extraction of key feature-subsets using stepwise feature selection method was performed. The resultant robust model gives a good estimate of the maximal amount of recombinant protein and the fermentation conditions responsible for the optimal protein expression. Ultimately, our prediction tool could minimize the time spent on trial-and-error experiments to attain high yields of recombinant proteins.

Methods

1. Generation of datasets

The data used in the present study include the amino acid sequence of the recombinant protein expressed in the periplasm of E. coli, the corresponding protein expression yield measurable in milligrams per litre, and the parameters of the fermentation process conditions. The data were collected from: i) an existing prediction model, namely Periscope [21]; ii) a literature search using popular search engines such as Scopus, Google Scholar, and PubMed; and (iii) our in-house experimental findings (Tables S1 and S2). These data were extracted from the research articles fulfilling the following criteria: i) E. coli strain as the host and lac promoters for expression; ii) heterologous protein expression in the periplasm; iii) SP at the N-terminus; iv) batch fermentation at shake flask scale, and finally, v) neither involving any genetic modification of the host strain nor including any co-expression vectors. On the whole, the present study comprised 461 datasets (collection of data) for 84 proteins and 103 SP-protein combinations. The sequence redundancy of the 103 SP-protein combinations was removed using the CD-HIT suite [24] at 90% of the sequence similarity threshold. Soluble protein expression level and protein expression yield were chosen as the response variables for the classification and regression tasks, respectively. The independent test datasets involved (i) data from ten proteins in which their amino acid sequences are completely unknown/unseen to the model, and (ii) data from eighteen proteins in which their amino acid sequences are known but the optimal fermentation conditions are unseen to the model. The latter data correspond to the optimal fermentation conditions reported in the studies dealing with the statistical optimization of recombinant proteins based on response surface methodology (RSM). The independent test datasets allow the verification of the global optimization scheme as well as the validation of the prediction performance of the model on the unknown amino acid sequences. The test datasets were further split manually to ensure the equal distributions of high, low, and medium instances as well as different sizes of proteins in the case of unseen amino acid sequences. All the data used in the current study are available as the supplementary information in this article.

Feature extraction

A total of 11,985 initial features were extracted and further classified into four major categories of feature. Since RPP is primarily regulated by the amino acid sequence of a protein, we focused on the features that can be extracted directly from the amino acid sequences, as well as features like physico-chemical and structural properties derived indirectly using these amino acid sequences. Amongst these features, Feature Category 1 (FC1) constitutes general features such as the length of the protein, occurrences of each type of the 20 amino acid residues, and the maximum number of consecutive identical amino acid residues. FC1 also included other features such as the occurrences and the maximum number of consecutive amino acid residues with similar physico-chemical properties. Lastly, the structural features like molecular weight, isoelectric point (pI), net charge, solubility, protein folding rate, and helix/sheet propensity were also added to FC1 (refer to Tables S3-1 to S3-3 for details). Because of the high dimensionality, the occurrences of each of the dipeptides were separately grouped into Feature Category 2 (FC2), as shown in Table S3-4. There is a possibility that the influence of each feature calculated from the amino acid sequences not only arises from the occurrences of respective residues but also due to the occurrence frequencies for a given protein length (i.e., the numbers of amino acid residues). To examine this possibility, we considered both occurrences as well as occurrence frequencies (standardized by the length of the protein) as two separate features. With these additional features, FC1 and FC2 consist of 149 and 800 features, respectively. All the interactions between each of the two features from FC1 resulted in the derivation of 11,026 interactive features classified under Feature Category 3 (FC3). Finally, Feature Category 4 (FC4) encompasses 10 features extracted from the fermentation process conditions. The extensive features in FC4 include cell density (measured as the optical density at 600 nm), inducer concentration upon induction of protein expression, post-induction temperature and time, and all the six interactions between these features (Table S3-5). In order to avoid any potential biases resulted from different levels of the dataset, normalization of the data was performed using Equation (1). All features except those in FC3 and interactive features from FC4 were normalized since these features were calculated using the normalized FC1 and FC4 datasets, respectively. where, and are the normalized and actual values, respectively, of the feature for the i-th protein, while and are the minimum and maximum actual values of the feature amongst all the n proteins. Cell density differs with respect to the type of fermentation media used, and accordingly, the cell densities resulting from different types of fermentation media were normalized separately. Similarly, the lac promoters could be induced by either isopropyl β-D-1-thiogalactopyranoside (IPTG) or lactose, and therefore, each type of inducer was normalized individually. As there is no direct relation reported between media or inducers, normalizations were performed using Equation (1) for all the data corresponding to each of the media and inducer type, and the normalized values for all the data were used directly for further study. The results of prediction of the optimized fermentation conditions assume utilizations of the most commonly used fermentation medium (Luria Bertani broth) and inducer (IPTG) as the default parameters.

Software packages and algorithms

The open-source software packages, Weka 3.8 [25] and R [26], were used in this study because they offer a wide range of algorithms for data-preprocessing, feature selection, and ML tasks. The ML tasks such as feature selection and benchmarking experiments were performed using Weka. Furthermore, the models were trained and evaluated using R with the machine learning in R (MLR) package [27]. A filter-based method, Correlation-based Forward Selection Subset Evaluator (CfsSubsetEval), was used in the selection of initial features. CfsSubsetEval selects the feature subsets on the basis of their high correlations to the response variable against their weak inter-correlation. Additionally, a wrapper-based method known as Classifier Subset Evaluator (ClassifierSubsetEval) was used in the final step of feature selection where the merits of the features was obtained by evaluating each feature subset in conjunction with the classification or regression algorithm; albeit being a time-consuming approach, ClassifierSubsetEval function could generate a reliable selection outcome. Three different algorithms, namely SVM, XGB, and RF, were used in both classification and regression tasks because they have been widely employed in bioinformatics for protein-based prediction tasks. These three algorithms were extensively tested and evaluated in the stages of feature selection, model training and independent test. Eventually, the most appropriate algorithm was selected for the specific tasks of classification and regression.

Feature selection

In this study, the important features were selected for both classification and the regression tasks based on a stepwise feature selection strategy, which includes: (i) Features were selected from both FC2 and FC3 using the CfsSubsetEval method along with the search method, Best First. (ii) The numbers of the selected FC3 features were further reduced using CfsSubsetEval and Greedy Stepwise methods. The ‘Generate ranking’ option was set to true, and the number of features to be retained was fixed as ten. (iii) Finally, the key optimal features were selected from all the features from FC1 and FC4 along with previously selected features from FC2 (Step 1) and FC3 (Step 2) using ClassifierSubsetEval. All the three classification and regression algorithms used in conjunction with ClassifierSubsetEval resulted in the selection of different key features and of varying numbers. The final selections of features as well as the training algorithm were conducted in consideration of the performance of the trained model together with the numbers and nature of the selected features. Furthermore, the relative importance of the selected features was evaluated according to the previously reported strategy [21], which is briefly described here: Each of the selected features was removed one at a time until all the selected features were removed completely; subsequently, the models were trained using the best-performing algorithms, and the changes in performance measures were computed and compared.

Training and evaluation of model

The three ML algorithms, SVM, XGB, and RF, were employed in both classification and regression tasks. Fine-tuning of the hyper-parameters was not performed because the default values of these parameters in the MLR package gave a better result. Based on the performances of the given algorithms on the training datasets, the best-performing algorithm was chosen for further development of the model. Several widely-used performance metrics, including accuracy, error rate, precision, recall, F-measure, Mathew correlation coefficient (MCC), and area under the curve (AUC), were used in the performance evaluation of the classification models [see Equations (2 – 7)]. Similarly, Pearson correlation coefficient (PCC), mean absolute error (MAE), and root mean squared error (RMSE) were also calculated and used for the assessment of regression models [Equations (8 – 10)] [21]: where , , and represent the numbers of true positives, true negatives, false positives, and false negatives, respectively. is the actual value of the protein expression yield; is the predicted protein expression yield; is the number of instances used in the prediction. The predictive performance of the models was assessed using internal cross-validation (CV) test where the whole dataset was split into either training dataset or internal testing dataset, based on the method utilized for CV. Leave-one-out cross-validation (LOOCV) test was chosen as a CV method for the model assessment owing to the lower number of datasets used in both classification and regression tasks. In LOOCV, the model was trained with all but leaving out one dataset for internal testing, and the whole training process was repeated until all the datasets had been internally tested. The performance metrics were averaged for all the cases during LOOCV. Subsequently, the model was validated using a set of unseen instances, i.e., independent test dataset, and the corresponding performance of the model was correlated to the prediction ability of the developed model.

Webserver implementation

The prediction model has been implemented as an online webserver, PERISCOPE-Opt (https://periscope-opt.erc.monash.edu/), to provide users with easy access to the model and its predictions. Based on the amino acid sequences of SP and protein provided as inputs, the proposed model predicts the optimal fermentation conditions corresponding to the maximal yield of recombinant periplasmic protein. Reactjs framework was used in the web implementation, processing the user-defined input data and then returning the outcomes of the model predictions. To eliminate the potential impact of service disruption from other external web tools linked to our model, PERISCOPE-Opt requires the users to manually key in the data generated from the relevant external web tools. A few web tools retrieving the necessary protein information were suggested and can be assessed via the web interface.

Results

Computational framework of the proposed optimization model

The proposed prediction model (Fig. 1) was constructed as a two-stage architecture with the following components: i) two sets of XGB classifiers in the first stage to classify the expression level of the given protein as low, medium or high class, and ii) three sets of regression models trained using the algorithms (SVM and RF) to quantify the protein yield with respect to each class in the second stage. In the training of classification model, we adopted two strategies to address the issue of data imbalance arising from the unequal distribution of the three classes. Firstly, two binary classifiers were utilized in a way that the first classifier categorized the given data into the majority class (medium) and both the minority classes (high and low) together, followed by the classification of the minority classes using the second classifier. Secondly, we applied the Synthetic Minority Over-sampling Technique (SMOTE) to generate additional dummy data in the minority classes during both classification tasks (Table S4). Therefore, in the first stage, XGB–Classifier 1 categorized the input of amino acid sequence into either the class of medium-level expression or the class of non-medium-level expression. If the input falls into the class of non-medium-level expression, then XGB–Classifier 2 would further classify the input as the class of low-level expression or the class of high-level expression. Based on the predicted class generated in the first stage, one of the three regression models, i.e., SVM regression (SVR) for low expression data (“SVR-Low”), RF regression (RFR) for medium (“RFR-Medium”) or high (“RFR-High”) were employed in the second stage to predict the expression yield. The proposed model was further employed in the computation of expression levels and yields for 180 combinations consisting of various levels of process features [namely, optical density at 600 nm (OD) (0.4, 0.7, 1.0), IPTG (0.1, 0.5, 1.0 mM) as inducer, temperature (20, 25, 30, 37 °C) and time (4, 8, 12, 16, 24 h)] for the given input of amino acid sequence using R programming. Then, it classified each of the combinations into the respective classes and quantified the corresponding expression yields. Based on the resulting expression yields, we are able to arrive at the top ten values for the optimal periplasmic expression yields of recombinant proteins and the respective fermentation conditions.

Fig. 1

Framework of the proposed prediction model. Low: yield is<0.5 mg/L, Medium: yield is between 0.5 and 50 mg/L, High: yield is higher than 50 mg/L. Non-medium refers to both High and Low together.

Features selected for model construction

Our preliminary analysis indicated that the conventional methods of feature selection employing various WEKA-based algorithms resulted in the extraction of significant features mostly from FC2 and FC3 but with a very few or no features derived from FC1 and FC4 (results not shown). This outcome could be mainly due to the higher number of features in FC2 (800 numbers) and FC3 (11,026 numbers) than in FC1 (149 numbers) and FC4 (10 numbers), leading to the higher occurrence of their selection. Therefore, to reduce the bias caused by the high dimensionality of features, we applied a stepwise feature selection strategy to both the classification and the regression tasks. Our initial steps of feature selection adopted a filter-based algorithm, CfsSubsetEval, for the identification of main features from FC2 and FC3. The important FC2 and FC3 features were then combined with all the features from FC1 and FC4; a wrapper-based method, ClassifierSubsetEval, was used to further select key features. The selected features for both classification and regression tasks consists of features mostly from FC1 and FC4, a few within FC3 and none from FC2 (Tables 1 and S5). The features selected based on the amino acid sequences are occurrences and occurrence frequencies (i.e., occurrences per unit length of protein) of amino acids such as glutamic acid (Occ_E, OF_E), valine (Occ_V), sulfur (OF_S), phenylalanine (OF_F) and methionine (OF_M), and that of the maximum consecutive alanine (MNC_A, OF_MNC_A) and cysteine (OF_MNC_C). Apart from that, the key features based on the physico-chemical properties are occurrence frequency of aromatic (OF_Aromatic), aliphatic (OF_Aliphatic) and hydrophilic residues (OF_Hphil_ESG and OF_Hphil_KD calculated using ESG and KD methods). Other key features with respect to the structural properties that were identified to be vital for either classification or regression tasks are expected number of amino acids in transmembrane helices (Expno_AA_TM), the ratio of the helix to sheet propensity (Helix_to_Sheet_PHD), coil propensity (Coil_PHD) and solubility score (Pred_Sol). For the features related to fermentation process condition, temperature and OD×Time are the two most significant features to be considered in the classification task, while almost all the other process-condition features were significant in the regression task.

Assessment of feature importance

To further assess the relative importance of the selected features for the two classification and three regression tasks, we trained the models by removing one feature at a time until all were considered. The respective changes in the predictive performance were measured by benchmarking the above models against the model with all the selected key features for each of the tasks (Fig. 2, Fig. 3). OF_Aromatic, OF_MNC_C, and temperature were found to be the most important features for XGB–Classifier 1, as the removal of these features led to the drastic decreases in MCC and accuracy by 8–13% and 3–5%, respectively. Similarly, the removal of Occ_V or Helix_to_Sheet_PHD affected the performance of XGB–Classifier 2 drastically, as seen from the steep decreases in accuracies by 5–16% and in MCC by 4–15%, respectively. Seq_len can also be considered an important feature because its elimination impacted XGB–Classifier 2 negatively, resulting in the decreases of the recall, F-measure and MCC by 3–4%. For the regression task, the feature "temperature" was found to be crucial because its removal resulted in the increases in MAEs and RMSEs by 5–10% and 6–9%, respectively. Additionally, in the cases of SVR-Low and RFR-High, a slight decrease (0.5–1.5%) in PCC was observed after the elimination of features "OD" and IPTG. Strikingly, OD×Temp, OD×Time, IPTG×Temp and Temp×Time are the process-interaction features deemed to be significant for the regression models "RFR-Medium" and "RFR-High". The impact of these process-interaction features on the performance of regression tasks is also substantiated by removing each of the above-mentioned process-interaction features during the model training, which has resulted in 0.1–4% decrease in PCC and 0.1–9% increament in MAEs or RMSEs. Similarly, the absence of features such as OF_Hphil_ESG and OF_DmE resulted in the decreases in PCC, and elevated MAEs and RMSEs. Contrastingly, the interactions between process features played a minor role for SVR-Low, suggesting that these process interactions, when controlled by fine-tuning process parameters, can yield a higher level of recombinant protein expression. Other than these pertinent features, FC3 interactive features "Occ_N×MNC_Y" and "Occ_Y×OF_DmE" showed a major decline (17%) in PCC when they were removed from the training datasets. In addition, other features such as Occ_E, OF_Hphil_ESG×Sheet_PHD, Coil_PHD, OF_Aliphatic, OD, OF_Aromatic, Pred_Sol, MNC_A, Expno_AA_TM, and OD×Time had minor effects on the predictive performance after their removal during model training; nonetheless, we retained these features because they improved the performance.

Fig. 2

Feature importance for a) XGB Classifier 1b) XGB Classifier 2. Performance of the model has been evaluated using ten times 10-fold cross validation (100 experiments).

Fig. 3

Feature importance for a) SVR-Low b) RFR-Medium c) RFR-High. Performance of the model has been evaluated using ten times 10-fold cross validation (100 experiments).

Feature importance for a) XGB Classifier 1b) XGB Classifier 2. Performance of the model has been evaluated using ten times 10-fold cross validation (100 experiments). Feature importance for a) SVR-Low b) RFR-Medium c) RFR-High. Performance of the model has been evaluated using ten times 10-fold cross validation (100 experiments).

Performance of classification and regression tasks

Both classification tasks, namely i) classification of medium and non-medium classes, and ii) subsequent classification of high and low classes, were benchmarked with the three well-known and widely used algorithms: XGB, RF, and SVM. XGB was found to outperform both SVM and RF in both classification tasks (Fig. 4A). In the internal validation tests, the average accuracies of XGB–Classifiers 1 and 2 are 76.45% and 77.27%, respectively, while their respective average accuracies in the independent tests are 82.14% and 85.71%. In both cases of classification tasks, the performance measures such as precision, recall, F-measure and AUC were also found to be above 0.75 while the MCC was around 0.5 (Table 2). Similarly, regression tasks were benchmarked using the similar algorithms (i.e., XGB, RF, and SVM) coupled with LOOCV for the three classes - high, low and medium (Fig. 4B). In the internal cross validation, the PCCs of three regression models were the highest, i.e., SVR-Low (0.83), RFR-Medium (0.90) and RFR-High (0.87). The PCCs of these regression models in the independent testing were also higher than their respective counterparts. As shown in Table 3, the lower values of MAEs and RMSEs of the three chosen regression models are in good agreement with the range of expression yields of each class. The values of PCC, MAE, and RMSE of SVR-Low, RFR-Medium and RFR-High suggest that the developed regression model can predict the protein expression yields with greater accuracy and reliability. The MAE and RMSE values for SVR-Low remained too low (0.06 and 0.09, respectively), while the values of these measures increased drastically for RFR-Medium (59–65 times that of SVR-Low) and RFR-High (12–13 times compared to RFR-Medium) (Table 3). Such an increase in MAE and RMSE values is common due to the high orders of the ranges within the expression-yield levels in each of these classes.

Fig. 4

Benchmarkingofthe performance of different algorithms. a) Classification tasks for both training and testing datasets b) Regression tasks for both training and testing datasets.

Table 2

Classification Task – Benchmarking with three algorithms.

Algorithm	Classifier 1			Classifier 2
Algorithm	RF	XGB	SVM	RF*	XGB	SVM
Selected number of features	4	8	16	1	4	6
Accuracy (%)	81.36	76.45	63.35	–	77.27	68.18
Error rate (%)	18.63	23.55	36.65	–	22.73	31.82
Precision	0.814	0.764	0.636	–	0.776	0.682
Recall	0.814	0.764	0.634	–	0.773	0.682
F-measure	0.813	0.764	0.629	–	0.773	0.682
MCC	0.626	0.527	0.267	–	0.549	0.362
AUC	0.913	0.788	0.643	–	0.791	0.747

Performance of the model has been evaluated using leave-one-out cross validation (LOOCV).

Since there is only one key feature selected, further model training is neither essential nor meaningful in this case.

Table 3

Regression task – Benchmarking with three algorithms.

Algorithm	Regression – Low			Regression – Medium	Regression – High
Algorithm	RF	XGB	SVM	RF	XGB	SVM	RF	XGB	SVM
Selected number of features	5	5	12	12	12	14	9	6	10
Pearson Correlation Coefficient (PCC)	0.7891	0.7103	0.8288	0.8971	0.7574	0.8623	0.8664	0.8534	0.8137
Mean Absolute Error (MAE)	0.0738	0.2759	0.0623	3.6673	8.8066	4.6152	47.4097	114.973	47.0493
Root Mean Squared Error (RMSE)	0.0944	0.2993	0.0887	5.7796	13.5347	6.6317	76.694	163.13	90.4669

Performance of the model has been evaluated using leave-one-out cross validation (LOOCV).

Benchmarkingofthe performance of different algorithms. a) Classification tasks for both training and testing datasets b) Regression tasks for both training and testing datasets. Selected features for prediction model. Classification Task – Benchmarking with three algorithms. Performance of the model has been evaluated using leave-one-out cross validation (LOOCV). Since there is only one key feature selected, further model training is neither essential nor meaningful in this case. Regression task – Benchmarking with three algorithms. Performance of the model has been evaluated using leave-one-out cross validation (LOOCV). Predicted yields at the given experimental conditions. The misclassified instances are represented by an asterisk (*). TB – Terrific broth (medium); L – Lactose (inducer). Predicted maximal predicted yields and the corresponding fermentation conditions. The misclassified instances are represented by an asterisk symbol (*).

Predictive performance of the developed model

Independent test validation gave an overall classification accuracy of 75% for the 28 unseen instances; the prediction results showed that 21 instances, including the six unseen proteins, were correctly classified and are close to the real experimental values (Table 4). Similarly, the model predicted the expression yield of the correctly classified instances with a high PCC of 0.91. Regardless of the misclassification, the PCC for the prediction of all the 28 instances remained substantially high (0.80). The MAE and RMSE values for the correctly classified instances were found to be remarkably low, i.e., 22.38 and 62.07, respectively.

Table 4

Predicted yields at the given experimental conditions.

No	SP-protein combination	Process conditions				Actual expression	Predicted expression
No	SP-protein combination	OD (au)	IPTG (mM)	Temperature (°C)	Time (h)	Yield (mg/L)	Level	Yield (mg/L)	Level
1	pel-B-eGFP	0.5	0.1	18	4	4.7	M	3.0	M
2	Cex-eGFP	0.7	0.5	27	4	4.8	M	0.3	*L
3	ompA-eGFP	1	0.5	18	4	2.8	M	2.0	M
4	ompC-eGFP	0.7	0.5	18	4	5.7	M	2.6	M
5	Lpp-eGFP	0.4	0.1	18	4	0.6	M	1.6	M
6	DmsA-eGFP	1	1	18	4	80.1	H	20.7	M
7	MdoD-eGFP	0.4	0.5	27	4	14.2	M	6.2	M
8	pel-B-TMT	0.4	0.5	28	4	53.2	H	27.1	*M
9	Cex-TMT	0.4	1	38	4	137.5	H	120.8	H
10	ompA-TMT	–	–	–	4	0.0	L	0.0	L
11	ompC-TMT	–	–	–	4	0.0	L	0.0	L
12	Lpp-TMT	0.7	1	18	4	95.5	H	116.5	H
13	DmsA-TMT	0.4	0.7	28	4	482.2	H	175.3	H
14	MdoD-TMT	0.4	1	38	4	24.5	M	9.0	M
15	pelB-IFN	4 (TB)	0.05	25	14	0.4	L	0.1	L
16	pelB-VEGFR2-D3	1	1	37	20	2.0	M	4.1	M
17	pho-rhES	0.6	0.3	25	13.57	2.2	M	1.8	M
18	modspA-CALB	1	12.5%(L)	24	15 h	234.0	H	126.5	H
19	MBP-6 × His-U24	0.5–1.0	0.3	18	16	2.8	M	3.0	M
20	Pel-B-SynVNAR-A6	0.5	0.1	18	21	27.0	M	7.5	M
21	modBlaasp-hAct A	0.6	1	37	8	150.0	H	0.0	*L
22	CusF-GFP	0.5	0.1	12	25	8.0	M	5.2	M
23	ecotin-HArbd	0.6–0.8	0.4–1	30	‘8–10	10.0	M	13.3	M
24	mBiP-scFv	0.5	0.2	30	5	115.0	H	7.3	*M
25	pelB-scFv-dmOKT3	0.8	0.1	22–24	18–20	0.2	L	13.3	*M
26	stII-vtPA	0.5	1	30	6	0.2	L	0.1	L
27	LTIIb-B-CT-B	0.3	0.02	37	6	190.0	H	8.3	*M
28	pelB-rPA	0.7	1	24	21	0.0	L	0.2	L

The misclassified instances are represented by an asterisk (*). TB – Terrific broth (medium); L – Lactose (inducer).

Prediction of the maximal RPP using the proposed model

For a target protein, the proposed model predicted the top ten protein yields and the corresponding fermentation process conditions. We evaluated all the 28 independent test datasets and presented the most optimal yields and fermentation conditions (Table 5). To further evaluate the prediction performance of the proposed model, we compared the predictions of the top ten optimal yields and fermentation conditions for the proteins used in our experiments with the predicted yields at the given conditions using the respective statistical regression equations (Tables S6-1 to S6-14). The predicted results were in close agreement with each other, highlighting that our model mimics the individual RSM-based regression models and has an additional advantage of extending the predictions to any recombinant protein.

Table 5

Predicted maximal predicted yields and the corresponding fermentation conditions.

No	SP-protein combination	Experimental
		Process conditions				Optimal expression		Process conditions				Optimal expression
		OD(au)	IPTG(mM)	Temperature(°C)	Time(h)	Yield(mg/L)	Level	OD(au)	IPTG(mM)	Temperature(°C)	Time(h)	Yield(mg/L)	Level
		OD(au)	IPTG(mM)	Temperature(°C)	Time(h)	Yield(mg/L)	Level	OD(au)	IPTG(mM)	Temperature(°C)	Time(h)	Yield(mg/L)	Level
1	pel-B-eGFP	0.5	0.1	18	4	4.7	M	1	0.1	20	4	3.8	M
2	Cex-eGFP	0.7	0.5	27	4	4.8	M	0.7	0.5	25	16	4.6	M
3	ompA-eGFP	1	0.5	18	4	2.8	M	0.7	1	25	24	1.9	M
4	ompC-eGFP	0.7	0.5	18	4	5.7	M	0.7	0.5	25	4	4.3	M
5	Lpp-eGFP	0.4	0.1	18	4	0.6	M	0.7	0.5	30	24	1.9	M
6	DmsA-eGFP	1	1	18	4	80.1	H	1	0.5	30	4	38.4	*M
7	MdoD-eGFP	0.4	0.5	27	4	14.2	M	0.4	0.5	30	8	8.1	M
8	pel-B-TMT	0.4	0.5	28	4	53.2	H	1	0.5	30	4	31.9	*M
9	Cex-TMT	0.4	1	38	4	137.5	H	0.4	1	37	24	141.5	H
10	ompA-TMT	–	–	–	4	0.0	L	1	1	30	24	0.1	L
11	ompC-TMT	–	–	–	4	0.0	L	0.4	1	20	4	26.0	*M
12	Lpp-TMT	0.7	1	18	4	95.5	H	0.4	1	37	24	151.6	H
13	DmsA-TMT	0.4	0.7	28	4	482.2	H	0.4	0.5	30	8	268.3	H
14	MdoD-TMT	0.4	1	38	4	24.5	M	0.4	1	37	8	16.3	M

The misclassified instances are represented by an asterisk symbol (*).

Discussion

In the present work, we have developed a robust ML-based tool that is capable of predicting the top-ten maximal protein expression yield and their fermentation process conditions for the expression of recombinant proteins in the periplasm of E. coli. Importantly, we have combined the key features from both amino acid sequence and fermentation process to gain a better understanding of the important determinants of the recombinant protein expression and to construct a precise model that yields good predictions. Our results demonstrate that the developed prediction model offers greater predictability and reliability as to the experimental findings. The primary reason behind the successful prediction by the optimization model is the strength and diversity of the datasets used in the development of the model. Another notable factor is the appropriate selection of the feature-subsets that represent each of the models precisely. The screening of important and meaningful features was made possible by the vast number of features extracted from literature and the use of the stepwise feature selection strategy. The diverse sets of the selected features (e.g., amino acid composition, physico-chemical and structural properties), together with the features related to the fermentation process, were processed by the stepwise feature selection strategy to yield a more meaningful prediction result. Our feature selection strategy revealed that the features based on amino acid sequence seem to play a vital role in the classification of “low” and “high” classes, hinting that the expression yield of a protein is completely dependent on its amino acid sequence. Therefore, based on the amino acid sequence, the expression of a recombinant protein is defined as “low”, “medium” or “high”, and a further fine-tuning of the process parameters can result in the substantial amounts of protein expression yields within that particular class. Apart from that, out of the selected features, the occurrence of valine (Occ_V), occurrence frequencies of aromatic residues (OF_Aromatic) and hydrophilic residues (OF_Hphil_ESG), difference in the occurrence frequency of aspartic acid minus glutamic acid residues (OF_DmE), the ratio of the helix to sheet propensity (Helix_to_Sheet_PHD), temperature (Temp) and the interaction between temperature and time (Temp×Time) seemed to be highly relevant and significant. Past studies corroborated the importance of the selected features as well. For instance, the probability of expressing the protein in soluble form was inversely correlated to the size of protein [28], thereby hinting that Seq_len is an essential feature. Besides, the composition of amino acid was found to be a critical factor inducing the metabolic stress during RPP in E. coli [29]; hence, the expression of recombinant protein can be improved by adjusting the amino acids composition [30]. The present study revealed specifically that the occurrences and occurrence frequencies of amino acids such as Occ_E, Occ_V, OF_E, OF_S, OF_F, OF_M, MNC_A, OF_ MNC_A, and OF_MNC_C are the significant factors for the soluble protein expression in the periplasm of E. coli. Similarly, the occurrences of the hydrophilic residues, i.e., proline (P), tyrosine (Y), histidine (H), glutamine (Q) and asparagine (N) seemed to be key determinants in SVR-Low and RFR-Medium in this study. The protein solubility was proven to be affected by the presence of hydrophilic amino acids in the protein, which may in turn influence the expression levels of recombinant protein in E. coli [30], [31]. Trevino et al. (2007) showed that the solubility of ribonuclease from Streptomyces aureofaciens (RNase Sa) was enhanced by the presence of amino acid residues such as aspartic acid (D), glutamic acid (E) and serine (S) in the protein sequence as compared to the other hydrophilic residues such as asparagine (N), glutamine (Q), and threonine (T) [32]. Therefore, the occurrence frequency of aspartic acid residues minus glutamic acid residues (OF_DmE) being a significant feature in RFR-High was validated. Although pI has been identified as a key feature in the development of XGB-Classifier 1, it is shown that pI does not affect the protein solubility or expression of mammalian proteins in E. coli [33]. Combining fermentation-process-based features with amino-acid-sequence-based features is a highly beneficial initiative. For instance, a previous study revealed that the recombinant expression of insoluble proteins is highly correlated with temperature and fermentation time [34]. Our results showed that Temp×Time interaction feature is a significant process feature. Also, it is well known that a lower cell density could result in a lower expression yield [35], while different concentrations of inducer have varying levels of influence on the yields of recombinant protein [36]. Our findings supported these facts and demonstrated the importance of OD and IPTG features in the development of regression models (SVR-Low and RFR-High). Furthermore, existing literature on the statistical optimization of RPP indicated that the interactions between the process features are significant in soluble protein expression [7], [37], [38], [39], [40], [41], [42]. Our models (XGB Classifier 1, SVR-Low, RFR-Medium, and RFR-High) confirmed that the interactions among the process features (cell density, inducer concentration, post-induction time, and temperature) contributed significantly to the prediction of the expression yields. Further investigation of the interactions between these process features in governing the expression of the recombinant protein may be fruitful in gaining a better understanding of the keys facets to achieve high yields of recombinant protein. At the given fermentation conditions, the predicted expression yields of proteins for the correctly classified instances were found to be closely matching to the actual values given in the independent test datasets used (Table 4). For example, the predicted expression yields of the recombinant proteins pel-B-eGFP, ompA-eGFP, MBP-6 × His-U24, and ecotin-HArbd (classified as medium expression) as well as ompA-TMT, ompC-TMT and stII-vtPA (classified as low expression) resembled the actual expression levels of periplasmic proteins in E. coli closely. Only a few instances, such as the predicted expression yields of cex-TMT and Lpp-TMT (classified as high expression), showed slight variation from the actual values of expression yields reported in the literature, while the predicted expression yields of DmsA-TMT and modspA-CALB showed moderate deviations from their actual expression yields. However, the expression yields as predicted from the misclassified instances varied tremendously, particularly for those instances of High-class of protein expression being misclassified into either low or medium class. For example, a high deviation in the predicted expression yields was noted in the cases of modBlaasp-hAct A, mBiP-scFv, and LTIIb-B-CT-B, which were misclassified as low or medium classes of protein expression. This undesirable prediction outcome was caused by the very high orders of difference in the ranges of these three classes; hence, a poor classification accuracy of both XGB-Classifiers 1 and 2 eventually affects the performance of the overall prediction model. This issue was addressed by adopting three different regression models to cover a wide range of protein expression levels as categorized by the classification models. If the expression level of a target protein is classified correctly in the first stage of prediction, the expression yield of the target protein will be highly likely to be accurately predicted by the respective regression model. The maximal protein expression yield was predicted by computing the protein expression yields under various combinations of fermentation process conditions and by selecting the top-ten maximal yields. A significant improvement in the prediction performance was noted when different combinations of fermentation process condition were included in the testing sets (Table 5). One of the previously misclassified instances, namely cex-eGFP, was correctly classified when different fermentation process combinations were considered during the testing; accordingly, the predicted expression yield (4.6 mg/L) was close to the actual levels (4.8 mg/L). Similarly, the predicted yield of cex-TMT expression (141.5 mg/L) approximately matched the actual protein expression levels (137.5 mg/L). Most of the predicted levels of protein expression were close to their actual levels, except for a few instances (e.g., DmsA-eGFP, ompC-TMT, DmsA-TMT and Lpp-TMT) showing slight deviation in the predicted protein expression levels (Table 5). Therefore, the predicted top-ten expression yields and the corresponding fermentation conditions suggested that the model predictions were similar to those achieved during experiments (Table 5 and Tables S6-1 to S6-6). For instance, the fermentation conditions corresponding to the maximal predicted yields of pel-B-eGFP, ompC-eGFP, and DmsA-TMT exactly match the experimental conditions that lead to the optimal yields of protein, while for the other instances, these fermentation conditions are almost similar to the optimal conditions as predicted by our model. Based on the supplementary tables (Tables S6-1 to S6-6), it is evident that process-level features play an important role in the expression of “high” class, while interactions within process-level features are significant in the “medium” expression class, which is also substantiated by the feature selection strategy (Table 1). Therefore, our prediction tool enables an easy optimization of RPP by suggesting (i) whether a particular target protein will be able to express in significant amounts and (ii) the ranges of fermentation parameters based on the predicted top-ten expression levels of target protein. These predictions could provide a good basis towards experiment design, by choosing an appropriate (i) target protein, (ii) selection of signal peptide and (iii) a set of fermentation conditions to start with, in an attempt to achieve the desired yields of recombinant proteins. Such insights will also be valuable in the subsequent optimization studies conducted to improve the yield and design of the industrial-scale RPP in E. coli.

Table 1

Selected features for prediction model.

Feature category	Classification models		Regression models
Feature category	XGB–Classifier 1	XGB–Classifier 2	SVR–Low	RFR–Medium	RFR–High
FC1	Occ_E	Seq_len	OF_E	OF_F	OF_DmE
	MNC_A	Occ_V	OF_S	OF_M
	OF_MNC_C	pI_Protpi	OF_Aliphatic	OF_MNC_A
	OF_Aromatic	Helix_to_Sheet_PHD	OF_Hphil_ESG	OF_Aromatic
	Expno_AA_TM		OF_Hphil_KD	OF_Hphil_ESG
			Coil_PHD	Pred_Sol
FC2	–	–	–	–	–
FC3	OF_Hphil_ESG×Sheet_PHD	–	–	–	Occ_N×MNC_Y
FC3					Occ_Y×OF_DmE
FC4	Temperature	–	OD	Temperature	IPTG
	ODxTime		Temperature	OD×IPTG	Temperature
			OD×Temp	OD×Temp	OD×Temp
			OD×Time	OD×Time	OD×Time
			IPTG×Temp	IPTG×Temp	IPTG×Temp
			Temp×Time	Temp×Time	Temp×Time
Selected features	8	4	12	12	9

The major limitation in model development for predicting protein expression yields is the scarcity of the availability of experimental results from the relevant studies. Although there are many reported studies about the production of periplasmic recombinant proteins by E. coli, all these data could not be considered in model development because of: i) the missing information related to the fermentation process conditions; ii) the lack of or irretrievable amino acid sequence; iii) the irrelevant scale of the fermentation process (micro-level or bioreactor level); iv) the non-quantifiable protein concentration (in mg/L); v) the vectors using promoters other than lac promoters, and hosts or vectors being modified as different genetic variants. Secondly, the data collected are prone to some degrees of variability contributed from the different protocols of fermentation and protein quantification used by researchers; for example, the scale of shake flask fermentation may add up to these variations. Finally, the variability due to specific host strains of E. coli and the corresponding vectors can vastly impact the recombinant protein expression yields. However, the majority of these general limitations have been addressed in the best possible ways during the development of PERISCOPE-Opt. For example, in spite of the available data being scarce, which becomes a trade-off for the prediction accuracy of the optimization model, the dataset generated for the development of the proposed model is robust because it consisted of a wide range of proteins (84 different types) and SP-protein combinations (103 different types). Further, the data incorporated proteins of all sizes (i.e., the smallest protein contained 80 amino acids while the biggest protein was of 668 amino acids long in size) along with a good number of instances corresponding to each class: high, low or medium. Next, the variability due to the data collected from various sources was kept to minimal by considering data generated using the shake flask fermentation so that the process conditions, including agitation and mixing, will be quite similar. Micro-scale and bioreactor-based fermentations were ruled out of present study as these methods may offer additional variabilities in process conditions compared to shake flask fermentations. Different schemes of protein downstream-processing, purification or quantifications can affect the expression yields of final protein but these recombinant protein yield data collected at the particular fermentation conditions are still comparable and relevant to our model development. Lastly, lac-operator based vectors were considered in the data collection to avoid biases in protein expression due to the uses of other types of vector. Genetically improved E. coli host strains such as those with additional molecular chaperones were avoided as these strains may exhibit additional variabilities, while the conventionally used E. coli host strains were assumed to express periplasmic protein similarly. The variability caused by the different conventional host strains being used for fermentation were not resolved as it is out of scope of the present study. Nevertheless, considering E. coli host strains and other vectors as a profound variable will be a good option as the addition of features with respect to the properties of E. coli host and vectors will be highly beneficial to the improvement of the prediction accuracy of the proposed model. Similarly, other aspects that can potentially help to improve the prediction accuracy of the developed model is the incorporation of other relevant features based on gene factors (codon bias and mRNA secondary structures). Apart from that, the incorporation of features corresponding to the experimentally-derived structural properties of proteins instead of using the sequence-derived and predicted structural properties will tend to improve the performance of the prediction model. The inclusion of these novel features in future works will serve as avenues to fine-tune the developed model for improved prediction.

Conclusions

The ML tools available for the protein-based applications generally consider features with respect to the amino acid sequence and are incapable of predicting the optimal conditions of RPP. Therefore, an ML-based model has been developed by combining the features from amino acid sequence and the fermentation process to predict the optimal yield as well as the corresponding fermentation conditions for the expression of a given recombinant protein in the periplasm of E. coli. Our proposed two-stage framework, PERISCOPE-Opt, successfully suggested the optimal recombinant protein yields matching closely with the reported experimental results. The recommended optimal yields and the corresponding fermentation conditions give an overall idea of the fermentation process for the expression of a target protein. PERISCOPE-Opt could serve as a powerful and reliable web tool that identifies the optimal fermentation conditions and RPP yield without reliance on the excessive rounds of trial-and-error experiments.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

36 in total

1. PROSO II--a new method for protein solubility prediction.

Authors: Pawel Smialowski; Gero Doose; Phillipp Torkler; Stefanie Kaufmann; Dmitrij Frishman
Journal: FEBS J Date: 2012-05-21 Impact factor: 5.542

2. Prediction of recombinant protein overexpression in Escherichia coli using a machine learning based model (RPOLP).

Authors: Narjeskhatoon Habibi; Alireza Norouzi; Siti Z Mohd Hashim; Mohd Shahir Shamsir; Razip Samian
Journal: Comput Biol Med Date: 2015-09-30 Impact factor: 4.589

3. ESPRESSO: a system for estimating protein expression and solubility in protein expression systems.

Authors: Shuichi Hirose; Tamotsu Noguchi
Journal: Proteomics Date: 2013-05 Impact factor: 3.984

4. Optimization of an induction strategy for improving interferon-alpha2b production in the periplasm of Escherichia coli using response surface methodology.

Authors: Siti Nor Ani Azaman; Nagasundara Ramanan Ramakrishnan; Joo Shun Tan; Raha Abdul Rahim; Mohd Puad Abdullah; Arbakariya B Ariff
Journal: Biotechnol Appl Biochem Date: 2010-08-02 Impact factor: 2.431

5. Learning to predict expression efficacy of vectors in recombinant protein production.

Authors: Wen-Ching Chan; Po-Huang Liang; Yan-Ping Shih; Ueng-Cheng Yang; Wen-chang Lin; Chun-Nan Hsu
Journal: BMC Bioinformatics Date: 2010-01-18 Impact factor: 3.169

Review 6. Escherichia coli as an antibody expression host for the production of diagnostic proteins: significance and expression.

Authors: Sergiu Huleani; Michael R Roberts; Lucy Beales; Emmanouil H Papaioannou
Journal: Crit Rev Biotechnol Date: 2021-09-01 Impact factor: 8.429

7. Influence of hydrophilic amino acids and GC-content on expression of recombinant proteins used in vaccines against foot-and-mouth disease virus in Escherichia coli.

Authors: Hongfei Wei; Mingli Fang; Min Wan; Hua Wang; Peiyin Zhang; Xiaoping Hu; Xiuli Wu; Ming Yang; Yongsheng Zhang; Lei Zhou; Chengfeng Jiao; Li Hua; Wenzhen Diao; Yue Xiao; Yongli Yu; Liying Wang
Journal: Biotechnol Lett Date: 2013-12-29 Impact factor: 2.461

Review 8. Recombinant protein expression in Escherichia coli: advances and challenges.

Authors: Germán L Rosano; Eduardo A Ceccarelli
Journal: Front Microbiol Date: 2014-04-17 Impact factor: 5.640

9. Optimization of expression, purification and secretion of functional recombinant human growth hormone in Escherichia coli using modified staphylococcal protein a signal peptide.

Authors: Garshasb Rigi; Amin Rostami; Habib Ghomi; Gholamreza Ahmadian; Vasiqe Sadat Mirbagheri; Meisam Jeiranikhameneh; Majid Vahed; Sahel Rahimi
Journal: BMC Biotechnol Date: 2021-08-16 Impact factor: 2.563

10. Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli.

Authors: Catherine Ching Han Chang; Chen Li; Geoffrey I Webb; BengTi Tey; Jiangning Song; Ramakrishnan Nagasundara Ramanan
Journal: Sci Rep Date: 2016-03-02 Impact factor: 4.379