| Literature DB >> 35317239 |
Zundan Ding1, Feifei Guan1, Guoshun Xu1,2, Yuchen Wang3,1, Yaru Yan1, Wei Zhang1, Ningfeng Wu1, Bin Yao2, Huoqing Huang2, Tamir Tuller4, Jian Tian1.
Abstract
The expression of proteins in Escherichia coli is often essential for their characterization, modification, and subsequent application. Gene sequence is the major factor contributing expression. In this study, we used the expression data from 6438 heterologous proteins under the same expression condition in E. coli to construct a deep learning classifier for screening high- and low-expression proteins. In conjunction with conserved residue analysis to minimize functional disruption, a mutation predictor for enhanced protein expression (MPEPE) was proposed to identify mutations conducive to protein expression. MPEPE identified mutation sites in laccase 13B22 and the glucose dehydrogenase FAD-AtGDH, that significantly increased both expression levels and activity of these proteins. Additionally, a significant correlation of 0.46 between the predicted high level expression propensity with the constructed models and the protein abundance of endogenous genes in E. coli was also been detected. Therefore, the study provides foundational insights into the relationship between specific amino acid usage, codon usage, and protein expression, and is essential for research and industrial applications.Entities:
Keywords: Deep learning; MPEPE; Mutation; Protein expression
Year: 2022 PMID: 35317239 PMCID: PMC8913310 DOI: 10.1016/j.csbj.2022.02.030
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Datasets classification and its size.
| Dataset | Evaluation Scores | Class | Sequence Number | Constructed Datasets |
|---|---|---|---|---|
| lixiProtein Expression Dataset | 1 | Negative data | 1754 | low expression dataset |
| 2 | Negative data | 131 | ||
| 3 | Negative data | 423 | ||
| 4 | – | 896 | validation dataset | |
| 5 | – | 1171 | ||
| 6 | Positive data | 1973 | high expression dataset |
Fig. 1The workflow of MPEPE based on deep learning and evolutionary analysis. A. The protein datasets were used as inputs for constructing and training the prediction model. B. Mutation sites were screened using evolutionary analysis of target protein sequence without disrupting function. C. The target nucleotide sequences were used as inputs in the MPEPE to virtually screen mutants. D. Experimental validation on the effect of virtual screened mutants on their expression level in E. coli. E. A new data set was constructed based on the experimental results for optimizing the MPEPE model.
Fig. 2Comparison of the amino acid or codon frequencies in lowly and highly expressed proteins. A. Amino acids usage differences between the lowly and highly expressed proteins. B-C. Codon frequencies in the highly and-lowly expressed proteins, and in the endogenous E. coli genome. D-E. Codon frequencies in the highly and-lowly expressed proteins, and in the exogenous E. coli genome. In addition, “ns” denotes no significant, “*” denotes 0.01 < p-value ≤ 0.05, “**” denotes 0.001 < p-value ≤ 0.01, and “***” denotes p-value ≤ 0.001.
Fig. 3Evaluation predictive performance of the three constructed models. A-B. Receiver operator characteristic and precision recall curves for the three models output based on the results of 10-fold cross-validation. C. Model evaluation metrics. D. Prediction results of the three models on the independent test set class4 and class5. In addition, “ns” denotes no significant, “*” denotes 0.01 < p-value ≤ 0.05, and “**” denotes p-value ≤ 0.01.
Spearman rank correlation of the predicted high-level propensity with PA.a
| Number of Genes | ||||
|---|---|---|---|---|
| Bacteria | ||||
| 3063 | 0.1342 | 0.4036 | 0.4581 | |
| 2200 | 0.0447 | 0.2240 | 0.2494 | |
| 2943 | 0.2130 | 0.3270 | 0.3718 | |
| 1166 | 0.0443 | 0.2939 | 0.3816 | |
| 1064 | 0.0531 | 0.2685 | 0.3290 | |
| Archaea | ||||
| 1092 | −0.1538 | 0.2637 | 0.1428 | |
| Fungi | ||||
| 4646 | −0.0582 | 0.3386 | 0.3617 | |
Protein abundance data of genes from paxdb.
Pre1: Predicted high-level propensity with the coding scheme of the synonymous codon number.
Pre2: Predicted high-level propensity with the coding scheme of the specific amino acid.
Pre3: Predicted high-level propensity with the coding scheme of the specific nucleotide combination.
Fig. 4The entropy of the residue and the distribution of the mutations on the sequence and structure of laccase 13B22 and FAD-AtGDH. A–C. Residue entropy of the laccase 13B22 (A) and FAD-AtGDH (B). The black dot represents the location of the screened mutation. The strand, helix, and coil are the predicted secondary structure based on the method. B–D. The distribution of the mutations on the structure of laccase 13B22 and FAD-AtGDH.
Fig. 5The detection of enzymatic activity and distribution of amino acid and codon of the mutants and wild-type of the laccase 13B22 and FAD-AtGDH. A. Measured enzymatic activity of the single-point mutants and wild-type of the laccase 13B22. B. The measured enzymatic activity of the single-point mutants and wild-type of the FAD-AtGDH. C-D. The amino acid and codon selection of the mutants and wild-type of the laccase 13B22(C) and FAD-AtGDH(D). The color bar represents the amino acid usage difference between the high- and low-level expressed genes.
Fig. 6Soluble expression and enzymatic activity assay of 13B22 and FAD-AtGDH. A. The Western-Blot of the expression of laccase 13B22 in supernatants in E. coli, L1M, L3M, L5M, and L7M represented 1, 3, 5, and 7 point mutants of laccase 13B22, respectively. B. The SDS-PAGE of the expression of laccase 13B22 in supernatants in E. coli, L1M, L3M, L5M, and L7M represented 1, 3, 5, and 7 point mutants of 13B22 respectively. C. The enzymatic activity of laccase 13B22, L1M, L3M, L5M, and L7M represented 1, 3, 5, and 7 point mutants of 13B22 respectively. D. The SDS-PAGE of the expression of FAD-AtGDH in supernatants in E. coli, A1M, A3M, A5M, and A7M represented 1, 3, 5, and 7 point mutants of FAD-AtGDH respectively. E. The SDS-PAGE of the expression of FAD-AtGDH in precipitations in E. coli, A1M, A3M, A5M, and A7M represented 1, 3, 5, and 7 point mutants of FAD-AtGDH respectively. F. The enzymatic activity of FAD-AtGDH, A1M, A3M, A5M, and A7M represented 1, 3, 5, and 7 point mutants of FAD-AtGDH respectively.