| Literature DB >> 25339987 |
Rishi Das Roy1, Manju Bhardwaj2, Vasudha Bhatnagar3, Kausik Chakraborty4, Debasis Dash1.
Abstract
Eubacterial genomes vary considerably in their nucleotide composition. The percentage of genetic material constituted by guanosine and cytosine (GC) nucleotides ranges from 20% to 70%. It has been posited that GC-poor organisms are more dependent on protein folding machinery. Previous studies have ascribed this to the accumulation of mildly deleterious mutations in these organisms due to population bottlenecks. This phenomenon has been supported by protein folding simulations, which showed that proteins encoded by GC-poor organisms are more prone to aggregation than proteins encoded by GC-rich organisms. To test this proposition using a genome-wide approach, we classified different eubacterial proteomes in terms of their aggregation propensity and chaperone-dependence using multiple machine learning models. In contrast to the expected decrease in protein aggregation with an increase in GC richness, we found that the aggregation propensity of proteomes increases with GC content. A similar and even more significant correlation was obtained with the GroEL-dependence of proteomes: GC-poor proteomes have evolved to be less dependent on GroEL than GC-rich proteomes. We thus propose that a decrease in eubacterial GC content may have been selected in organisms facing proteostasis problems.Entities:
Year: 2014 PMID: 25339987 PMCID: PMC4193397 DOI: 10.12688/f1000research.4307.1
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Figure 1. Integration of independent studies.
A Venn diagram of proteins of E. coli identified by different experimental studies shows that ~45% of soluble proteins reported by Niwa et al. overlap with GroEL/S or DnaK substrates (soluble proteins are defined as having solubility >70% and aggregation-prone proteins have solubility <30%).
Comparison of previous classifiers with our classifier.
| Method | Sensitivity | Specificity | Accuracy | AUC | MCC |
|---|---|---|---|---|---|
| SVM
[ | 80 | ||||
| J48 (decision tree algorithm)
[ | 72 | 0.72 | |||
| VTJ48 (visually tuned J48)
[ | 76 | 0.81 | |||
| Fang
| 82.00 | 85.00 | 84 | 0.91 | 0.67 |
|
|
|
|
|
|
|
* Built on a curated training data set.
Figure 2. Receiver operating characteristic (ROC) curves.
ROC curves of the soluble protein classifier (SolubEcoli.pgc) and the GroEL obligate protein classifier (GDP1.pgc). The areas under the curves (AUC) are given in the legend.
Selected features of proteins used to build the “SolubEcoli.pgc” classifier.
| Serial
| Feature id† | Description | p-value* |
|---|---|---|---|
| 1 | SW_SOC2 | Quasi-sequence-order calculated from
| 2.20E-16 |
| 2 | PPR | Distribution of positively charged amino
| 2.20E-16 |
| 3 | H(8)M | Amino acid pair composition of histidine to
| 2.33E-15 |
| 4 | M-B(Hydr)1 | Moreau-Broto auto correlation (lag 1) of
| 2.24E-08 |
| 5 | PseAAC_T1_3 | Pseudo amino acid composition of aspartic
| 9.45E-06 |
| 6 | PseAAC_T1_5 | Pseudo amino acid composition of
| 6.87E-05 |
| 7 | FI_16_psavgl | Average length of folded segments of
| 8.14E-05 |
| 8 | PseAAC_T1_4 | Pseudo amino acid composition of glutamic
| 0.000542 |
| 9 | Dstrbu_Pol_2:3 | Distribution of amino acids according to
| 0.001289 |
| 10 | M-B(mutblty)6 | Moreau-Broto auto correlation (lag 6) of
| 3.65E-03 |
| 11 | T | Composition of amino acid Threonine
[ | 5.00E-03 |
| 12 | Mrn(vlum)27 | Moran auto correlation (lag 27) of amino
| 0.00926 |
| 13 | Mrn(Polar)22 | Moran auto correlation (lag 22) of amino
| 0.013 |
| 14 | M-B(mutblty)9 | Moreau-Broto auto correlation (lag 9) of
| 0.01988 |
| 15 | Geary(sterc)4 | Geary auto correlation (lag 4) of amino acid
| 0.03536 |
| 16 | M-B(mutblty)24 | Moreau-Broto auto correlation (lag 24) of
| 0.05416 |
| 17 | M-B(Hydr)12 | Moreau-Broto auto correlation (lag 12) of
| 5.92E-02 |
| 18 | Mrn(RsdAcc)24 | Moran auto correlation (lag 24) of amino
| 0.1077 |
| 19 | Mrn(Hydr)23 | Moran auto correlation (lag 23) of amino
| 0.2106 |
| 20 | Geary(Free)13 | Geary auto correlation (lag 13) of amino
| 0.3271 |
| 21 | Comp_Vol_2 | Composition of normalized van der Waals
| 0.4631 |
| 22 | Geary(vlum)20 | Geary auto correlation (lag 20) of amino
| 4.95E-01 |
| 23 | Geary(Free)14 | Geary auto correlation (lag 14) of amino
| 0.499 |
| 24 | M-B(vlum)30 | Moreau-Broto auto correlation (lag 30) of
| 0.9559 |
†Internal feature id of the Pro-Gyan application.
Figure 3. Composition of basic amino acids over ~1100 eubacterial genomes.
The x-axis of each subplot shows for GC composition of each genome whereas y-axis shows corresponding amino acid composition.
Figure 4. Aggregation-prone proteins are richer in GC-content than soluble proteins.
In E. coli, aggregation-prone proteins contain higher GC-content than soluble proteins. Mann-Whitney test p-value (*) is 1.3e-15.
Figure 5. GC content is associated with fAg.
( A) GC content of the genome correlates with the fraction of proteome that is aggregation-prone (fAg) (analysis of 570 bacterial genomes using the classifier). Rank-based correlation is provided along with the p-value. The black line shows a linear regression model. ( B) The relationship between GC content and fAg was obtained through a phylogenetically independent contrast method (570 bacteria). A positive correlation (0.4) was identified between GC content and fAg (p-value < 2.1e-16).
Figure 6. Decrease in GC content is associated with decrease in fC3.
( A) Correlation of GC content with the fraction of the proteome that is GroEL obligate (fC3) over 570 bacterial genomes. Members of the phylum Tenericutes with and without the groEL gene are coloured in blue and red, respectively. Rank-based correlation is provided along with p-value. The black line shows a logarithmic regression model. ( B) A positive correlation (0.7) was identified between independent contrast of GC content and fC3 with respect to phylogenetic information of bacterial genomes (570 bacteria, p-value < 2.2e-16). ( C) The organisms were classified based on the number of groEL genes present in the genome. fC3 exhibited a significant increase with an increase in the number of genome-encoded groEL copies. The p-values were calculated by Mann-Whitney test using two-sided hypothesis.
Evaluation of classifiers on five C3 homologous proteins of groEL-lacking Ureaplasma urealyticum.
The homologous were found in U. urealyticum by NCBI BLAST at a threshold of E value of 1e45. Then the aggregation propensity and GroEL dependency of these proteins were classified by SolubEcoli.pgc and GDP1.pgc.
| C3 homologous proteins
| E value | Accession | Is aggregation prone?
| Is GroEL dependent?
|
|---|---|---|---|---|
| UuMetK | 2e-99 | YP_002284849.1 | Yes (0.739) | No (0.804) |
| UuDeoA | 2e-80 | WP_004026878.1 | Yes (0.586) | No (0.938) |
| UuCsdB | 4e-62 | D82890 | Yes (0.884) | Yes (0.665) |
| UuGatY | 8e-46 | H82870 | Yes (0.672) | No (0.973) |
| UuYcfH | 7e-41 | E82944 | Yes (0.518) | No (0.881) |
Figure 7. fAG and fC3 are correlated to the GC content independent from the species habitat.
The ANCOVA test on 570 organisms showed that a symbiotic relationship has no significant effect or interaction with GC content on the aggregation propensity or GroEL-dependency of the proteins of an organism.