| Literature DB >> 33077783 |
Hongguang Fu1, Yanbing Liang1, Xiuqin Zhong2, ZhiLing Pan3, Lei Huang1, HaiLin Zhang3, Yang Xu1, Wei Zhou1, Zhong Liu4.
Abstract
Heterologous expression is the main approach for recombinant protein production ingenetic synthesis, for which codon optimization is necessary. The existing optimization methods are based on biological indexes. In this paper, we propose a novel codon optimization method based on deep learning. First, we introduce the concept of codon boxes, via which DNA sequences can be recoded into codon box sequences while ignoring the order of bases. Then, the problem of codon optimization can be converted to sequence annotation of corresponding amino acids with codon boxes. The codon optimization models for Escherichia Coli were trained by the Bidirectional Long-Short-Term Memory Conditional Random Field. Theoretically, deep learning is a good method to obtain the distribution characteristics of DNA. In addition to the comparison of the codon adaptation index, protein expression experiments for plasmodium falciparum candidate vaccine and polymerase acidic protein were implemented for comparison with the original sequences and the optimized sequences from Genewiz and ThermoFisher. The results show that our method for enhancing protein expression is efficient and competitive.Entities:
Mesh:
Substances:
Year: 2020 PMID: 33077783 PMCID: PMC7572362 DOI: 10.1038/s41598-020-74091-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Classification of codon boxes.
| Type of codon box | Codon box | Amino acid | Codon |
|---|---|---|---|
| Type-1 | {aaa} | Lys | AAA |
| {ccc} | Pro | CCC | |
| {ggg} | Gly | GGG | |
| {ttt} | Phe | TTT | |
| Type-2 | {aac} | Gln, Asn, Thr | CAA, AAC, ACA |
| {aag} | Arg, Glu, Lys | AGA, GAA, AAG | |
| {aat} | Ile, Asn | ATA, AAT | |
| {acc} | His, Pro, Thr | CAC, CCA, ACC | |
| {agg} | Arg, Glu, Gly | AGG, GAG, GGA | |
| {att} | Ile, Leu, Tyr | ATT, TTA, TAT | |
| {ccg} | Ala, Arg, Pro | GCC, CGC, CCG | |
| {cct} | Leu, Pro, Ser | CTC, CCT, TCC | |
| {cgg} | Ala, Arg, Gly | GCG, CGG, GGC | |
| {ctt} | Leu, Phe, Ser | CTT, TTC, TCT | |
| {ggt} | Gly, Trp, Val | GGT, TGG, GTG | |
| {gtt} | Cys, Leu, Val | TGT, TTG, GTT | |
| Type-3 | {acg} | Ala, Arg, Asp, Gln, Ser, Thr | GCA, CGA, GAC, CAG, AGC, ACG |
| {act} | His, Ile, Leu, Ser, Thr, Tyr | CAT, ATC, CTA, TCA, ACT, TAC | |
| {agt} | Asp, Met, Ser, Val | GAT, ATG, AGT, GTA | |
| {cgt} | Ala, Arg, Cys, Leu, Ser, Val | GCT, CGT, TGC, CTG, TCG, GTC |
According to the codon box concept, 64 codons can be divided into 20 kinds of codon boxes. Furthermore, the codon boxes can be classified into three categories: Type-1 has only one kind of base; Type-2 has two kinds of bases; and Type-3 has three kinds of bases.
Figure 1One-to-one mapping of amino acids and codon boxes with codons. An example of how an amino acid (Gly and its corresponding codon box can uniquely determine a codon.
Figure 2Codon optimization flowcharts based on sequence annotation models. First, the original codon sequences are decoded into amino acid sequences. Then, they are annotated by the trained sequence annotation models. In the flowchart in (a), the amino acid sequence is annotated with 61 kinds of codons, except stop codons (named BiLSTM-CRF(a)), and in the flowchart in (b), the amino acid sequence is annotated with 20 kinds of codon boxes (named BiLSTM-CRF(b)). The difference in (b) is that the optimized codons are determined from the codon boxes in Table 1 due to the one-to-one mapping of amino acids and codon boxes with codons mentioned in the previous section. Generally, the annotation model with fewer tokens is better, and the complexity of BiLSTM-CRF(b) is lower than that of BiLSTM-CRF(a).
CAI comparison between original sequences and optimized sequences.
| DNA | bp | Original | Genewiz | ThermoFisher | BiLSTM-CRF(a) | BiLSTM-CRF(b) |
|---|---|---|---|---|---|---|
| HPDF | 615 | 0.70 | 0.85 | 0.92 | 0.96 | 0.98 |
| PAE | 1839 | 0.76 | 0.81 | 0.92 | 0.96 | 0.98 |
| MMPL3 | 2835 | 0.67 | 0.79 | 0.93 | 0.96 | 0.98 |
| FALVAC-1 | 972 | 0.67 | 0.85 | 0.93 | 0.95 | 0.96 |
| PA | 561 | 0.60 | 0.83 | 0.93 | 0.97 | 0.98 |
| PTP4A3 | 564 | 0.70 | 0.83 | 0.93 | 0.96 | 0.98 |
| Average | 1231 | 0.68 | 0.83 | 0.93 | 0.96 | 0.98 |
This table shows the CAIs of the sequences optimized by different optimization tools, among which the values for Genewiz and ThermoFisher are provided on their official websites (ThermoFisher: www.thermofisher.com, Genewiz: www.genewiz.com). BiLSTM-CRF(b) has the highest average CAI, showing that it has great potential to enhance protein expression.
Comparative analysis of Jaccard similarity.
| DNA | Original | Genewiz | ThermoFisher | BiLSTM-CRF(a) |
|---|---|---|---|---|
| PTP4A3 | 0.68 | 0.74 | 0.80 | 0.85 |
| PA | 0.62 | 0.72 | 0.82 | 0.90 |
| PAE | 0.70 | 0.70 | 0.79 | 0.90 |
| FALVAC-1 | 0.62 | 0.73 | 0.80 | 0.88 |
| HPDF | 0.70 | 0.73 | 0.80 | 0.90 |
| MMPL3 | 0.65 | 0.69 | 0.76 | 0.89 |
| Average | 0.66 | 0.72 | 0.80 | 0.89 |
Jaccard similarity index between the optimized sequences of BiLSTM-CRF(b) and others.
Figure 3Comparison of protein expression levels for FALVAC-1 and PTP4A3. (a) shows the results of western blotting for FALVAC-1. (b) shows the results of western blotting for PTP4A3.
Comparison of grayscale value ratios corresponding to Fig. 3a.
| Original | Genewiz | Thermo | Opt-b | Opt-a | |
|---|---|---|---|---|---|
| Group 1 | 0.221 | 0.875 | 0.548 | 2.178 | 1.669 |
| Group 2 | 0.090 | 0.742 | 0.352 | 2.115 | 1.747 |
| Group 3 | 0.245 | 0.901 | 0.331 | 1.935 | 1.762 |
| Average value | 0.186 | 0.839 | 0.410 | 2.076 | 1.726 |
| Optimization ratio | 1 | 4.511 | 2.204 | 11.129 | 9.462 |
The comparison of grayscale value ratios between FALVAC-1 and GAPDH. The optimization ratio is the ratio of each method's average value to the original average value.
Comparison of grayscale value ratios corresponding to Fig. 3b.
| Original | Genewiz | Thermo | Opt-b | Opt-a | |
|---|---|---|---|---|---|
| Group 1 | 2.448 | 2.863 | 3.006 | 3.033 | 3.017 |
| Group 2 | 3.398 | 3.506 | 3.564 | 4.568 | 3.266 |
| Group 3 | 1.727 | 0.901 | 3.073 | 3.145 | 3.594 |
| Average value | 2.558 | 3.147 | 3.238 | 3.780 | 3.292 |
| Optimization ratio | 1 | 1.23 | 1.266 | 1.400 | 1.287 |
The comparison of grayscale value ratios between PTP4A3 and GAPDH. The optimization ratio is the ratio of each method's average value to the original average value.
Figure 4The assay of protein function for PTP4A3. In vitro phosphatase assays showed that the activities of proteins expressed by five sequences were almost equal (where p > 0.05). Different sequences are represented by different colors.