| Literature DB >> 28187713 |
Tran Tuan-Anh1, Le Thi Ly2, Ngo Quoc Viet3, Pham The Bao4.
Abstract
BACKGROUND: Since the recombinant protein was discovered, it has become more popular in many aspects of life science. The value of global pharmaceutical market was $87 billion in 2008 and the sales for industrial enzyme exceeded $4 billion in 2012. This is strong evidence showing the great potential of recombinant protein. However, native genes introduced into a host can cause incompatibility of codon usage bias, GC content, repeat region, Shine-Dalgarno sequence with host's expression system, so the yields can fall down significantly. Hence, we propose novel methods for gene optimization based on neural network, Bayesian theory, and Euclidian distance. RESULT: The correlation coefficients of our neural network are 0.86, 0.73, and 0.90 in training, validation, and testing process. In addition, genes optimized by our methods seem to associate with highly expressed genes and give reasonable codon adaptation index values. Furthermore, genes optimized by the proposed methods are highly matched with the previous experimental data.Entities:
Keywords: Bayes’ theorem; Codon usage bias; Euclidean distance; Gene optimization; Highly expressed gene; Neural network
Mesh:
Substances:
Year: 2017 PMID: 28187713 PMCID: PMC5303253 DOI: 10.1186/s12859-017-1517-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Collected data including redesigned genes and respective product
| Host | Product | Number of genes | Reference |
|---|---|---|---|
| E. coli BL21 | DNA Polymerase and scFV | 62 | [ |
| E. coli BL21 | Cystatin C | 2 | [ |
| E. coli BL21 | PEDF | 2 | [ |
| E. coli W3110 | Prochymosin | 7 | [ |
Fig. 1Properties of HEGP and DHEG. The top plots are distribution of HEG’s CAI value. The bottom right plot is distribution of HEG’s GC value. The bottom center plots illustrate HEG probability and distance to HEG of randomly generated gene sequences with respect to their CAI and GC value
Fig. 2Comparison between NN and linear regression
P-value from Shapiro-Wilk normality test for the correlation of NN and correlation of linear regression
| Training | Validation | Testing | |
|---|---|---|---|
| NN | 0.03 | 0.00 | 9.10 × 10−5 |
| Linear regression | 0.21 | 0.11 | 0.13 |
Fig. 3Visualization for fitness function based on NN (log scale) with respect to CAI and GC value
Fig. 4Comparison between gene optimization methods. The plots are 2-dimensional distribution of redesigned genes. X and Y coordinate are CAI and GC value, respectively
Descriptive statistics for optimized genes
The orange cells represent for values, which are different more than 5% from values of HEG, and vice versa for green cells
P-value from Shapiro-Wilk normality test for optimized genes
P-values from Wilcoxon signed-rank test for difference between HEG and optimized genes
Result of optization for gene coding for prochymosin and comparison with experimental result from Menzella’s study
| Method | CAI | GC | Patterns matching | ||||
|---|---|---|---|---|---|---|---|
| Menzella | 0.72 | 0.49 | 6 nucleotides | 7 nucleotides | 8 nucleotides | 9 nucleotides | Total |
| Jcat | 0.96 | 0.50 | 177 | 69 | 22 | 12 | 280 |
| Eugene | 0.94 | 0.50 | 48 | 20 | 6 | 2 | 76 |
| HEGP | 0.81 | 0.51 | 173 | 55 | 15 | 6 | 249 |
| DHEG | 0.73 | 0.50 | 216 | 61 | 13 | 1 | 291 |
| NN | 0.47 | 0.49 | 185 | 57 | 20 | 7 | 269 |
| NNP | 0.67 | 0.50 | 204 | 68 | 26 | 13 | 311 |
| NND | 0.64 | 0.50 | 198 | 63 | 17 | 4 | 282 |
| Linear | 0.52 | 0.52 | 185 | 64 | 13 | 2 | 264 |
Fig. 5Demonstration program and user’s guide