| Literature DB >> 35449391 |
Ádám Györkei1,2,3, Lejla Daruka2,3, Dávid Balogh2, Erika Őszi4, Zoltán Magyar4, Balázs Szappanos1,2, Gergely Fekete1,2, Mónika Fuxreiter5,6, Péter Horváth2,7, Csaba Pál8, Bálint Kintses9,10,11, Balázs Papp12,13.
Abstract
Proteins are prone to aggregate when expressed above their solubility limits. Aggregation may occur rapidly, potentially as early as proteins emerge from the ribosome, or slowly, following synthesis. However, in vivo data on aggregation rates are scarce. Here, we classified the Escherichia coli proteome into rapidly and slowly aggregating proteins using an in vivo image-based screen coupled with machine learning. We find that the majority (70%) of cytosolic proteins that become insoluble upon overexpression have relatively low rates of aggregation and are unlikely to aggregate co-translationally. Remarkably, such proteins exhibit higher folding rates compared to rapidly aggregating proteins, potentially implying that they aggregate after reaching their folded states. Furthermore, we find that a substantial fraction (~ 35%) of the proteome remain soluble at concentrations much higher than those found naturally, indicating a large margin of safety to tolerate gene expression changes. We show that high disorder content and low surface stickiness are major determinants of high solubility and are favored in abundant bacterial proteins. Overall, our study provides a global view of aggregation rates and hence solubility limits of proteins in a bacterial cell.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35449391 PMCID: PMC9023497 DOI: 10.1038/s41598-022-10427-1
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Distinguishing between rapidly and slowly aggregating proteins using GFP fusion. A C-terminally fused GFP tag reports on the relative time of aggregation across proteins. Nascent polypeptides with very high aggregation rates form anomalous intermolecular interactions before the GFP chromophore is committed to form[17] and therefore the fused GFP tag would show no fluorescence, yielding dark aggregates (lower pathway, black ball). Note that this route of aggregation likely occurs before the protein is folded, possibly as early as translation. Proteins that aggregate after the GFP chromophore is committed to form would result in fluorescent aggregates (upper pathway) (de Groot & Ventura, 2006). These proteins aggregate relatively slowly, after being fully synthesized (i.e. post-translationally).
Figure 2Experimental workflow and distribution of protein aggregation phenotypes. (A) Workflow of high-throughput protein solubility measurement and classification. (B) Distribution of cellular fluorescence phenotypes based on 1,000 classified cells for each overexpressed cytoplasmic protein (represented by dots). The location of each dot is calculated from the fraction of cells showing each phenotype, with red lines representing the decision boundaries between aggregation categories assigned. The majority of proteins are located close to the vertices, demonstrating that cells typically show homogenous aggregation behavior. (C) Comparison of in vivo and in vitro solubility phenotype of proteins. In vitro aggregating proteins show a strong overlap with proteins forming either dark or fluorescent aggregates in vivo (odds ratio = 14.5, P < 10–10, Fisher’s exact test). (D) Frequency of proteins according to their aggregation phenotypes.
Figure 3Protein features distinguishing between slowly and rapidly aggregating proteins. Features discriminating between proteins that form dark (rapid) versus fluorescent (slow) aggregates. The predictive ability of each feature was measured as the average area under the receiver operating characteristic (ROC) curve in a tenfold cross-validation procedure based on logistic regression analyses. All displayed protein features are statistically significantly predictive after adjustment for multiple testing using the false discovery rate method, ** corresponds to p_adj < 0.01, * to p_adj < 0.05 (logistic regression). Error bars show the 95% confidence interval for the AUC value of each feature. Note that additional features with weaker discriminating ability are listed in Suppl. Table S5.
Figure 4Key molecular features associated with in vivo aggregation rate. (A) Proteins with more residues in aggregation hotspots, as estimated by AggreScan[26], are more likely to form dark (i.e. rapid) than fluorescent (i.e. slow) aggregates (P = 5.01*10–7, logistic regression). (B) Native mRNA expression levels of proteins in fluorescent aggregates are significantly higher than those in dark aggregate (P = 0.0008, Wilcoxon rank-sum test). (C,D) Effects of protein contact order and folding rates on the class of aggregation. Note that a lower contact order and a higher folding rate (FOLD-RATE score) indicate easier folding. Dark aggregates are associated with a lower folding ability. (E) Proteins in fluorescent aggregates are enriched in DnaK chaperone clients compared to those in dark aggregates (P = 4.94*10–5, Odds ratio = 2.43, Fisher’s exact test). Whiskers show standard errors and were calculated by bootstrap resampling.
Figure 5Protein stickiness and disorder content shape solubility. (A) We use the surface stickiness score to measure promiscuous interaction propensity[38]. Both fluorescent and dark aggregates show higher surface stickiness than soluble proteins (P < 10–10 and P < 10–10 respectively, Wilcoxon Rank Sum test). (B) Fractions of aggregating proteins as a function of disorder content (binned data), calculated using PONDR VSL2B. Upper panel shows the fraction of proteins in fluorescent aggregates among those that are either soluble or in fluorescent aggregates, while the lower panel shows the fraction of proteins in dark aggregates among those that are either soluble or in dark aggregates. (C) Disorder content as a function of native protein abundance in E. coli based on[41]. The most abundant 20% of E. coli proteins have a significantly higher disorder content than those in the least abundant 20% bin (P = 4.18*10–11, Wilcoxon rank sum test). Disorder content was calculated using PONDR VSL2B, but similar results are obtained with other predictors (see Tables S6 and S10).