| Literature DB >> 26934190 |
Armando Fernandes1, Susana Vinga1.
Abstract
The article focus is the improvement of machine learning models capable of predicting protein expression levels based on their codon encoding. Support vector regression (SVR) and partial least squares (PLS) were used to create the models. SVR yields predictions that surpass those of PLS. It is shown that it is possible to improve the models predictive ability by using two more input features, codon identification number and codon count, besides the already used codon bias and minimum free energy. In addition, applying ensemble averaging to the SVR or PLS models also improves the results even further. The present work motivates the test of different ensembles and features with the aim of improving the prediction models whose correlation coefficients are still far from perfect. These results are relevant for the optimization of codon usage and enhancement of protein expression levels in synthetic biology problems.Entities:
Mesh:
Substances:
Year: 2016 PMID: 26934190 PMCID: PMC4775025 DOI: 10.1371/journal.pone.0150369
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Illustration of the codon bias, codon identification number and codon count features.
Codon identification numbers.
| aaa | 1 | aac | 2 | aag | 3 | aat | 4 | aca | 5 | acc | 6 | acg | 7 | act | 8 |
| aga | 9 | agc | 10 | agg | 11 | agt | 12 | ata | 13 | atc | 14 | atg | 15 | att | 16 |
| caa | 17 | cac | 18 | cag | 19 | cat | 20 | cca | 21 | ccc | 22 | ccg | 23 | cct | 24 |
| cga | 25 | cgc | 26 | cgg | 27 | cgt | 28 | cta | 29 | ctc | 30 | ctg | 31 | ctt | 32 |
| gaa | 33 | gac | 34 | gag | 35 | gat | 36 | gca | 37 | gcc | 38 | gcg | 39 | gct | 40 |
| gga | 41 | ggc | 42 | ggg | 43 | ggt | 44 | gta | 45 | gtc | 46 | gtg | 47 | gtt | 48 |
| taa | 49 | tac | 50 | tag | 51 | tat | 52 | tca | 53 | tcc | 54 | tcg | 55 | tct | 56 |
| tga | 57 | tgc | 58 | tgg | 59 | tgt | 60 | tta | 61 | ttc | 62 | ttg | 63 | ttt | 64 |
Fig 2–Schematics of the nested n-fold cross-validation, with repetition and ensemble averaging.
Test R2 and RMSE for various models created using Welch et al. and Supec and Smuc datasets.
The feature abbreviations are: SelBias—selected codon bias; MFE—minimum free energy; Bias—codon bias; ID—codon identification; Count—codon count.
| Dataset | Model | Algorithm | Features | R2 | RMSE | Observations |
|---|---|---|---|---|---|---|
| Welch | 1 | PLS | SelBias | 0.439 | 0.570 | Reference |
| Welch | 2 | PLS | MFE Bias ID Count | 0.487 | 0.537 | - |
| Welch | 3 | SVR | SelBias | 0.427 | 0.566 | - |
| Welch | 4 | SVR | MFE Bias ID Count | 0.504 | 0.523 | - |
| Supec and Smuc | 5 | PLS | MFE Bias | 0.565 | 1.91e3 | - |
| Supec and Smuc | 6 | PLS | MFE Bias ID Count | 0.680 | 1.63e3 | - |
| Supec and Smuc | 7 | SVR | MFE Bias | 0.632 | 1.75e3 | Reference |
| Supec and Smuc | 8 | SVR | MFE Bias ID Count | 0.698 | 1.57e3 | - |
Minimum and maximum values of the confidence intervals for differences in R2 at 5% significance level.
The situations in bold and italic have the best R2. The feature abbreviations are: SelBias—selected codon bias; MFE—minimum free energy; Bias—codon bias; ID—codon identification; Count—codon count. Comp is abbreviation for comparison, M for model and Alg for algorithm.
| Dataset | Comp | M | Alg | Features | M | Method | Features | R2 difference | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Min (x10-2) | Max (x10-2) | p-value | |||||||||
| Welch | A | 1 | PLS | SelBias | 3.66 | 5.92 | 2.10e-14 | ||||
| Welch | B | 3 | SVR | SelBias | 6.37 | 8.93 | 1.60e-24 | ||||
| Welch | C | 3 | SVR | SelBias | -2.55 | 0.184 | 0.0895 | ||||
| Welch | D | 2 | PLS | MFE Bias ID Count | 0.654 | 2.69 | 1.41e-3 | ||||
| Welch | E | 1 | PLS | SelBias | 5.22 | 7.70 | 4.03e-20 | ||||
| Supec and Smuc | F | 5 | PLS | MFE Bias | 10.9 | 12.1 | 3.08e-94 | ||||
| Supec and Smuc | G | 7 | SVR | MFE Bias | 5.93 | 7.22 | 1.97e-49 | ||||
| Supec and Smuc | H | 5 | PLS | MFE Bias | 6.05 | 7.28 | 1.58e-53 | ||||
| Supec and Smuc | I | 6 | PLS | MFE Bias ID Count | 1.14 | 2.37 | 7.72e-08 | ||||
| Supec and Smuc | J | 7 | SVR | MFE Bias | -5.41 | -4.23 | 1.67e-37 | ||||
Minimum and maximum values of the confidence intervals for differences in RMSE at 5% significance level.
The situations in bold and italic have the best RMSE. The feature abbreviations are: SelBias—selected codon bias; MFE—minimum free energy; Bias—codon bias; ID—codon identification; Count—codon count. Comp is abbreviation for comparison, M for model and Alg for algorithm.
| Dataset | Comp | M | Alg | Features | M | Method | Features | RMSE difference | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | Max | p-value | |||||||||
| Welch | A | 1 | PLS | SelBias | 2.52e-2 | 3.98e-2 | 1.45e-15 | ||||
| Welch | B | 3 | SVR | SelBias | 3.55e-2 | 5.04e-2 | 3.38e-23 | ||||
| Welch | C | 3 | SVR | SelBias | -4.33e-3 | 1.25e-2 | 0.339 | ||||
| Welch | D | 2 | PLS | MFE Bias ID Count | 8.44e-3 | 2.06e-2 | 4.88e-6 | ||||
| Welch | E | 1 | PLS | SelBias | 3.95e-2 | 5.46e-2 | 1.23e-25 | ||||
| Supec and Smuc | F | 5 | PLS | MFE Bias | 266 | 297 | 4.46e-90 | ||||
| Supec and Smuc | G | 7 | SVR | MFE Bias | 155 | 189 | 6.37e-49 | ||||
| Supec and Smuc | H | 5 | PLS | MFE Bias | 147 | 178 | 4.33e-51 | ||||
| Supec and Smuc | I | 6 | PLS | MFE Bias ID Count | 36.4 | 69.9 | 2.56e-09 | ||||
| Supec and Smuc | J | 7 | SVR | MFE Bias | -135 | -104 | 2.76e-34 | ||||
Fig 3Results for ensemble averaging of validation repetitions.
Fig 4Absolute percentage error for the ensembles from Fig 3.