| Literature DB >> 30341344 |
Mohd Faheem Khan1, Sanjukta Patra2.
Abstract
Protein stability is affected at different hierarchies - gene, RNA, amino acid sequence and structure. Gene is the first level which contributes via varying codon compositions. Codon selectivity of an organism differs with normal and extremophilic milieu. The present work attempts at detailing the codon usage pattern of six extremophilic classes and their harmony. Homologous gene datasets of thermophile-mesophile, psychrophile-mesophile, thermophile-psychrophile, acidophile-alkaliphile, halophile-nonhalophile and barophile-nonbarophile were analysed for filtering statistically significant attributes. Relative abundance analysis, 1-9 scale ranking, nucleotide compositions, attribute weighting and machine learning algorithms were employed to arrive at findings. AGG in thermophiles and barophiles, CAA in mesophiles and psychrophiles, TGG in acidophiles, GAG in alkaliphiles and GAC in halophiles had highest preference. Preference of GC-rich and G/C-ending codons were observed in halophiles and barophiles whereas, a decreasing trend was reflected in psychrophiles and alkaliphiles. GC-rich codons were found to decrease and G/C-ending codons increased in thermophiles whereas, acidophiles showed equal contents of GC-rich and G/C-ending codons. Codon usage patterns exhibited harmony among different extremophiles and has been detailed. However, the codon attribute preferences and their selectivity of extremophiles varied in comparison to non-extremophiles. The finding can be instrumental in codon optimization application for heterologous expression of extremophilic proteins.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30341344 PMCID: PMC6195531 DOI: 10.1038/s41598-018-33476-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Collected gene CDS from homologous extremophilic and non-extremophilic proteins and enumerated statistically significant codon features obtained after KS test (with p < 0.05).
| Comparing datasets | Number of genes (CDS) | Number of source organisms from which the CDS collected* | Data collection criteria and homology search method used | Enumerated statistically significant codons by KS test (with |
|---|---|---|---|---|
| T-M | 116 pairs | 37 thermophiles and 51 mesophiles | BLAST (>70% homology) and CLUSS 2 (alignment-free algorithm) | 33 (ATT, ATA, CTT, CTC, CTA, CTG, TTA, TTG, GTT, TGT, GCT, GCA, GGT, GGC, CCT, CCA, ACT, ACC, TCT, TCA, AGT, TAT, CAA, CAG, AAT, CAT, GAA, GAT, CGT, CGC, CGA, AGA, AGG) |
| P-M | 110 pairs | 27 psychrophiles and 50 mesophiles | CLUSS 2 (alignment-free algorithm) | 26 (AAG, AAT, AGA, AGG, AGT, ATA, ATG, CAA, CAG, CAT, CGT, CTC, CTG, GAC, GAT, GCA, GCG, GCT, GGA, GGT, GTA, TCC, TTA, TTC, TTG, TTT) |
| T-P | 110 pairs | 36 thermophiles 27 psychrophiles | CLUSS 2 (alignment-free algorithm) | 44 (AAG, AAT, ACC, ACT, AGA, AGG, AGT, ATA, ATG, ATT, CAA, CAT, CCC, CCT, CGA, CGC, CGT, CTA, CTC, CTG, CTT, GAA, GAC, GAT, GCA, GCT, GGA, GGC, GGT, GTA, GTC, GTG, TAA, TAC, TAT, TCA, TCC, TCT, TGA, TGT, TTA, TTC, TTG, TTT) |
| A-B | 112 pairs | 73 acidophiles and 85 alkaliphiles | CDS of those proteins having extreme optimum pH were collected (Acid stable, pH ≤ 6 and Alkaline stable, pH ≥ 8); CLUSS 2 (alignment-free algorithm) | 49 (TTT, TTC, TTG, CTT, CTC, CTG, ATA, ATC, ATG, GTT, TGA, TCC, TCA, TCG, CCT, CCC, CCA, ACT, ACC, ACA, ACG, GCT, GCG, TAT, TAC, TAA, CAA, CAC, CAG, AAA, AAT, AAC, AAG, GAT, GAA, GAC, GAG, TGG, CG0A, CGC, CGG, CTA, AGT, AGC, AGA, GGT, GGC, GGA, GTG) |
| H-Nh | 100 pairs | 19 halophiles and 12 non-halophiles | CLUSS 2 (alignment-free algorithm) | 40 (TTT, TTA, TTG, CTT, CTG, ATT, ATC, ATA, GTT, GTC, GTA, GTG, TCT, TCA, TCG, CCT, CCC, CCA, ACT, ACC, ACA, ACG, GCT, GCC, GCA, TAT, CAT, CAG, AAT, AAC, AAA, GAT, GAC, GAA, GAG, CGA, CGG, AGA, AGG, GGT) |
| B-Nb | 40 pairs | 6 barophiles and 5 non-barophiles | CLUSS 2 (alignment-free algorithm) | 23 (TTT, TTC, TTA, ATT, ATA, GTC, GTA, ACT, ACA, ACG, GCA, GCG, TAC, CAA, AAT, AAA, AAG, GAA, GAG, AGT, AGA, AGG, GGG) |
*Not all the organisms are extremophiles but the proteins having extremophilic physicochemical behavior were also included and their CDS were collected.
Figure 1Relative abundance of statistically significant codons in the comparing datasets: (A) T-M dataset, (B) P-M dataset, (C) T-P datasets, (D) A-B dataset, (E) H-Nh dataset and (F) B-Nb dataset. Green colour bars represent positive contributors of main datasets and negative contributors of counter dataset whereas, dark blue colour bars represent positive contributors counter datasets and negative contributors of main dataset.
Figure 2Graphical representation of ranking of statistically significant codons in the scale of 1–9 using a python script. Ranking of codons in the (A) T-M dataset, (B) P-M dataset, (C) T-P datasets, (D) A-B dataset, (E) H-Nh dataset (F) B-Nb dataset are represented in the figure.
Figure 3Nucleotide composition analysis by two parameters - (A) % AT- or % GC-richness and (B) % A/T- or % G/C-ending at third wobble position in the preferred significant codons for six types of extremophiles.
Figure 4Data-point analysis of most and least preferred codon w.r.t. extremophiles. Analysis of (A) AGG codon (most preferred w.r.t. thermophiles) of T-M dataset, (B) CAA codon (least preferred w.r.t. thermophiles) of T-M dataset, (C) CAA codon (most preferred w.r.t. psychrophiles) of P-M datasets (D) AGG codon (least preferred w.r.t. psychrophiles) of P-M datasets, (E) AGG codon (most preferred w.r.t. thermophiles) of T-P datasets (F) CAA codon (least preferred w.r.t. thermophiles) of T-P datasets, (G) TGG codon (most preferred w.r.t. acidophiles) of A-B dataset, (H) GAG codon (least preferred w.r.t. acidophiles) of A-B dataset, (I) GAC codon (most preferred w.r.t. halophiles) of H-Nh dataset (J) AGG codon (least preferred w.r.t. halophiles) of H-Nh dataset (K) AGG codon (most preferred w.r.t. barophiles) of B-Nb dataset (L) CAA codon (least preferred w.r.t. barophiles) of B-Nb dataset are represented in the figure. The green coloured data-points represent highest ranked codons with respect to either extremophiles or non-extremophiles whereas, dark blue coloured data-points represent lowest ranked codons with respect to either extremophiles or non-extremophiles.
Figure 5Positive contribution of codon features related to the codon harmony in extremophiles. The different types of extremophiles have been colour coded. The figure has been deduced from the relative abundance and codon ranking analysis applied on available datasets used in the present study.
Summary of results obtained by using 11 algorithms of attribute weighting employed on different datasets.
| T-M | P-M | T-P | A-B | H-Nh | B-Nb | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Codon features | Algorithms weighted above 0.5 | Codon features | Algorithms weighted above 0.5 | Codon features | Algorithms weighted above 0.5 | Codon features | Algorithms weighted above 0.5 | Codon features | Algorithms weighted above 0.5 | Codon features | Algorithms weighted above 0.5 |
| CAA | 10 | AGA | 8 | CAA | 10 | TGG | 9 | GAC | 11 | AGG | 10 |
| TAT | 9 | AAG | 7 | AGA | 9 | AAC | 8 | GTC | 9 | AAG | 10 |
| CGT | 9 | TTA | 7 | CGT | 9 | TAC | 8 | TTT | 8 | AAA | 8 |
| TCT | 7 | GGA | 7 | AGG | 9 | GCT | 4 | AAA | 6 | AGT | 7 |
| AAT | 6 | AGG | 7 | GGA | 8 | TAT | 4 | TTA | 6 | GAG | 5 |
| ATA | 5 | GCG | 5 | ATA | 7 | GAT | 4 | ACG | 6 | ATA | 4 |
| ACC | 5 | GAC | 4 | AAG | 7 | GGT | 4 | AGG | 6 | GAA | 4 |
| GCT | 5 | GGT | 4 | TTA | 6 | GGC | 4 | GAG | 5 | CAA | 4 |
| CAG | 5 | CAA | 4 | GGT | 5 | TTC | 3 | AAT | 5 | TAC | 4 |
| CTG | 5 | TCC | 4 | AAT | 4 | CTG | 3 | ATA | 4 | TTC | 3 |
| GCA | 4 | GAT | 3 | GAA | 4 | GTT | 3 | ATT | 4 | ACG | 3 |
| TGT | 4 | ATG | 2 | CTG | 4 | TCA | 3 | CTG | 4 | ATT | 3 |
| TCA | 4 | CTG | 2 | TAC | 4 | CCA | 3 | AGA | 3 | TTA | 2 |
| AGT | 4 | CGT | 2 | GAT | 3 | ACT | 3 | GAT | 3 | TTT | 2 |
| CGA | 4 | AGT | 1 | CTC | 3 | GCC | 3 | GAA | 3 | GCA | 1 |
| ACT | 4 | CAT | 1 | CAT | 3 | CAA | 3 | GCC | 3 | GTC | 1 |
| TTG | 4 | TTT | 1 | TTT | 2 | CAG | 3 | GTT | 3 | GGG | 1 |
| AGA | 4 | AAT | 1 | GCT | 2 | AAT | 3 | GTA | 2 | ||
| GGT | 4 | GCA | 1 | GAC | 2 | TTT | 2 | TAT | 2 | ||
| CCT | 3 | CAG | 1 | CCC | 2 | TTG | 2 | TTG | 2 | ||
| GGT | 3 | CTC | 1 | GGC | 2 | CTT | 2 | ACA | 2 | ||
| CAT | 3 | ATT | 2 | CTC | 2 | ACC | 2 | ||||
| ATT | 3 | GTG | 2 | ATT | 2 | GTG | 2 | ||||
| CTT | 3 | ACC | 2 | ATC | 2 | GCA | 2 | ||||
| GTT | 3 | AGT | 1 | ATG | 2 | CCT | 2 | ||||
| CCA | 3 | ACT | 1 | GTC | 2 | CTT | 2 | ||||
| GAT | 3 | TTC | 1 | TCC | 2 | AAC | 2 | ||||
| CTC | 2 | ATG | 1 | CCG | 2 | CCC | 1 | ||||
| GGC | 2 | TCC | 1 | ACC | 2 | ATC | 1 | ||||
| CGC | 2 | GCA | 1 | ACA | 2 | CAT | 1 | ||||
| AGG | 2 | TAT | 1 | ACG | 2 | TCT | 1 | ||||
| GAA | 2 | CGC | 1 | GCA | 2 | CAG | 1 | ||||
| CTA | 1 | AAG | 2 | CCA | 1 | ||||||
| TTA | 1 | GAC | 2 | TCG | 1 | ||||||
| AGT | 2 | CGA | 1 | ||||||||
| AGC | 2 | CGG | 1 | ||||||||
| AGA | 2 | GGT | 1 | ||||||||
| TCT | 1 | ||||||||||
| TCG | 1 | ||||||||||
| CCT | 1 | ||||||||||
| CCC | 1 | ||||||||||
| CAT | 1 | ||||||||||
| CGA | 1 | ||||||||||
| CGG | 1 | ||||||||||
Prediction accuracy of supervised learning for classification and model generation for various extremophiles on the basis of codon usage.
| Model | Criterion used and their percentage accuracy of prediction (%) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| T-M | P-M | T-P | A-B | H-Nh | B-Nb | |||||||
| Lazy modeling | 82.86 | Naïve Bayes | 76.47 | 92.65 | 71.15 | 91.67 | 96.55 | |||||
| Logistic regression | Anova kernel type | 78.08 | Anova kernel type | 75.00 | Dot kernel type | 92.65 | Anova kernel type | 78.08 | Anova kernel type | 83.33 | Anova kernel type | 86.21 |
| SVM | Anova kernel type | 87.61 | libSVM (C-SVC and nu-SVC type) | 80.88 | Dot kernel type | 91.81 | Dot kernel type | 81.23 | libSVM (c-SVC and nu-SVC type) | 90.00 | libSVM (c-SVC and nu-SVC type) | 96.55 |
| ANN | 2 hidden layer with 20 neurons in each layer | 87.61 | 2 hidden layers with 40 neurons in each layer | 80.88 | 1 hidden layer with 10 neurons | 92.65 | 3 hidden layers with 30 neurons in each layer | 78.85 | 2 hidden layers with 30 neurons in each layer | 91.67 | 2 hidden layers with 20 neurons in each layer | 89.66 |
| Decision Tree/ Random Forest | Information Gain | 78.57 | Information Gain | 75.00 | Information Gain | 92.65 | Gini Index | 80.77 | Gain Ratio | 85.00 | Gini Index | 96.55 |
Summary of decision tree prediction on extremophile datasets with their criteria chosen and best discriminatory rule for classification of codons.
| Dataset | Tree induction method | Criterion (algorithm) chosen | Number of models generated | Best possible discriminatory rule | Accuracy of prediction (%) |
|---|---|---|---|---|---|
| T-M | Decision Tree | Information Gain | 1 | If % CAA (≤1.866] and % ATA (>1.866] and % CGC (>1.866] and % CTT (>2.823] → | 78.57 |
| P-M | Random Forest | Information Gain | 500 internal trees | If % CAA (>4.092] → | 75.00 |
| T-P | Random Forest | Information Gain | 100 internal trees | If % CAA (≤1.056] and % CGT (≤1.029] → | 92.65 |
| A-B | Random Forest | Gini Index | 500 internal trees | If % GAG (≤4.202] and % CTC > 2.705] and % GAT (≤5.524] → | 80.77 |
| H-Nh | Decision Tree | Gain Ratio | 1 | If % GAC (>8.861] → | 85.00 |
| B-Nb | Random Forest | Gini Index | 500 internal trees | If % AGG (>3.007] and % ATA (>3.553] → | 96.55 |