Literature DB >> 23734225

Electrostatic mis-interactions cause overexpression toxicity of proteins in E. coli.

Gajinder Pal Singh1, Debasis Dash.   

Abstract

A majority of E. coli proteins when overexpressed inhibit its growth, but the reasons behind overexpression toxicity of proteins remain unknown. Understanding the mechanism of overexpression toxicity is important from evolutionary, biotechnological and possibly clinical perspectives. Here we study sequence and functional features of cytosolic proteins of E. coli associated with overexpression toxicity to understand its mechanism. We find that number of positively charged residues is significantly higher in proteins showing overexpression toxicity. Very long proteins also show high overexpression toxicity. Among the functional classes, transcription factors and regulatory proteins are enriched in toxic proteins, while catalytic proteins are depleted. Overexpression toxicity could be predicted with reasonable accuracy using these few properties. The importance of charged residues in overexpression toxicity indicates that nonspecific electrostatic interactions resulting from protein overexpression cause toxicity of these proteins and suggests ways to improve the expression level of native and foreign proteins in E. coli for basic research and biotechnology. These results might also be applicable to other bacterial species.

Entities:  

Mesh:

Substances:

Year:  2013        PMID: 23734225      PMCID: PMC3667126          DOI: 10.1371/journal.pone.0064893

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Expression levels of proteins can be highly optimized in bacterial cells to maximize fitness [1], but it may be desirable in lab to increase the expression level of proteins beyond their normal cellular levels, which often leads to growth inhibition [2]. Protein overexpression in model organism Escherichia coli is utilized in biophysical, biochemical, structural studies of proteins, production of industrial important enzymes [3] and development of strains for producing metabolites [4], biofuel [5] and bioremediation [6]. Furthermore, gene duplication and hence protein overexpression is also important from evolutionary and clinical perspective, where it can lead to novel phenotypes including antibiotic resistance [7], [8]. Hence it is important to understand the mechanism of overexpression toxicity of proteins in E. coli. In yeast, overexpression library of endogenous proteins has been described [9]. Overexpression leads to reduction in the growth rate in a subset (∼15%) of proteins, which were highly enriched in structural disorder [10]. Disordered regions and proteins in eukaryotes are widely associated with protein-protein and protein-DNA interactions [11]–[13], so their increased levels may lead to large number of promiscuous interactions [10] and thus toxicity. Disorder was also found to be associated with overexpression toxicity in other eukaryotes: Drosophila melanogaster and C. elegans, and with dosage sensitive oncogenes in mice and human [10]. In addition to disordered proteins, highly expressed proteins and members of protein complexes are highly sensitive to fold increase from their normal levels [14]. E. coli, like most bacteria have few disordered regions and proteins, thus disordered regions mediated promiscuous interactions could not be the major mechanism of overexpression toxicity in bacteria. An overexpression library has been described in E. coli called ASKA library (A Complete Set of E. coli K-12 ORF Archive) in which most of its ORFs have been individually cloned with histidines and seven spacer amino acids at the N-terminal end, and five spacer amino acids and GFP (Green Fluorescent Protein) at the C-terminal end in IPTG inducible, expression vector [2]. Effect on growth and GFP fluorescence by IPTG induction was examined for each of the clone and classified into three categories each (“almost no growth”, “slow growth”, “normal growth” and “high fluorescence”, “fluorescence”, “no fluorescence” respectively). Under these conditions, majority of proteins inhibit the growth of E. coli when overexpressed, while overexpression of a subset of proteins leads to severe toxicity. Particularly, membrane proteins are highly toxic on overexpression [2]. Here we study sequence and functional properties of cytoplasmic proteins of E. coli which are highly toxic on overexpression to understand its mechanism and find that number of positively charged residues to be the most important feature of toxic proteins. Functional classes also show differential enrichment: transcription factors and regulatory proteins were overrepresented, while catalytic proteins were underrepresented in toxic proteins.

Results

Protein overexpression upon IPTG induction (37C, LB) of the ASKA library leads to growth inhibition in about 79% of clones (52% “almost no growth” +27% “slow growth”), Figure 1. In “almost no growth” class a small fraction of clones do show GFP fluorescence (Figure 1), indicating some growth. Since we were interested in proteins whose overexpression is most toxic to E. coli, thus even small overexpression is likely to cause growth inhibition, we defined “toxic” proteins as those classified as “almost no growth” and “no fluorescence”. Overall 40% (1589/3956) of the clones fall into this category. Rest 60% proteins were labeled as “non-toxic”.
Figure 1

Effects on growth and GFP fluorescence of proteins on overexpression.

Protein overexpression was induced by adding IPTG to ASKA clones grown in LB medium at 37C, and effects on growth and GFP fluorescence was classified into three categories each (“almost no growth”, “slow growth”, “normal growth” and “high fluorescence”, “fluorescence”, “no fluorescence” respectively) [2]. We defined “toxic” proteins as those classified as “almost no growth” and “no fluorescence” (marked with blue outline). Overall 40% of clones fall into this category (1589/3956).

Effects on growth and GFP fluorescence of proteins on overexpression.

Protein overexpression was induced by adding IPTG to ASKA clones grown in LB medium at 37C, and effects on growth and GFP fluorescence was classified into three categories each (“almost no growth”, “slow growth”, “normal growth” and “high fluorescence”, “fluorescence”, “no fluorescence” respectively) [2]. We defined “toxic” proteins as those classified as “almost no growth” and “no fluorescence” (marked with blue outline). Overall 40% of clones fall into this category (1589/3956).

High toxicity of membrane and periplasmic proteins

Membrane proteins are known to be highly toxic when overexpressed [2]. About 85% of proteins with at least one predicted trans-membrane segment are toxic. This fraction increases further to 89% in proteins with two or more trans-membrane segments (Figure 2). With respect to localization, outer membrane proteins and periplasmic proteins are also very toxic (83% and 72% respectively), even though they rarely have predicted trans-membrane regions indicating that extreme toxicity is a general property of secretory proteins, not just proteins with trans-membrane segments. These results are consistent with the hypothesis that saturation of Sec translocation machinery (the major membrane translocation machinery in E. coli) by overexpression of secretory proteins is responsible for their extreme toxicity [15].
Figure 2

Percentage of toxic proteins as a function of number of trans-membrane segments.

In proteins without any trans-membrane segment, about 25% are toxic. This percentage increases to ∼73% in proteins with one trans-membrane segment and ∼85% in proteins with two or more trans-membrane segments. Number of trans-membrane segments were predicted using TMHMM [35].

Percentage of toxic proteins as a function of number of trans-membrane segments.

In proteins without any trans-membrane segment, about 25% are toxic. This percentage increases to ∼73% in proteins with one trans-membrane segment and ∼85% in proteins with two or more trans-membrane segments. Number of trans-membrane segments were predicted using TMHMM [35]. Considering the high and potentially different mechanism of toxicity of secretory from cytoplasmic proteins, we excluded membrane (outer and inner membrane) and periplasmic proteins from all further analyses, which leave 2444 proteins, 432 of which are toxic.

Sequence features associated with toxicity

To better understand the mechanism of toxicity of cytoplasmic proteins, we considered number of sequence features for their relationship with toxicity. On average, toxic proteins were found to have significantly higher number of positively (arginine and lysine) charged amino acid residues, are longer and have extreme isoelectric point (pI) (Figure 3a and Figure 3b). Number of positively charged residues is the most important feature associated with toxicity of proteins (Figure 3). The effect of length is only evident for very long proteins (Figure 3b). Significantly higher number of positively charged residues in toxic proteins indicates that electrostatic mis-interactions resulting from protein overexpression is an important cause of toxicity in E. coli.
Figure 3

Sequence features associated with toxicity.

(A) Toxic proteins have on average higher number of positively charged residues (arginine and lysine), isoelectric point (pI) and length than non-toxic proteins. Wilcox-test p values are 2e-17, 6e-4 and 5e-10 respectively. (B) Proteins are binned into equal sized 20 bins (thus each bin has 5% of proteins) and percentage of proteins which are toxic is plotted for each bin as a function of three sequence features. Linear regression lines are plotted for average positively charged residues and average length and quadratic regression is plotted for pI.

Sequence features associated with toxicity.

(A) Toxic proteins have on average higher number of positively charged residues (arginine and lysine), isoelectric point (pI) and length than non-toxic proteins. Wilcox-test p values are 2e-17, 6e-4 and 5e-10 respectively. (B) Proteins are binned into equal sized 20 bins (thus each bin has 5% of proteins) and percentage of proteins which are toxic is plotted for each bin as a function of three sequence features. Linear regression lines are plotted for average positively charged residues and average length and quadratic regression is plotted for pI.

Functional classes associated with toxicity

Next we analyzed functional classes significantly associated with toxic proteins. We considered higher level GO classes in which about 200 or more proteins were present (18 functional classes). Functional classes significantly overrepresented in toxic proteins are “nucleic acid binding transcription factor activity” and “regulation of cellular processes”, while the class significantly underrepresented is “catalytic activity” (Figure 4). Since many regulatory proteins are also transcription factors, we analyzed whether regulatory proteins excluding transcription factors are also enriched in toxic proteins. Excluding transcription factors, “regulation of cellular processes”, is still enriched in toxic proteins (Figure 4), suggesting that toxicity is associated with dysregulation of cellular processes in general.
Figure 4

Functional classes associated with toxicity.

Percentage of toxic proteins is much higher in transcription factors (Fisher p = 1e-13) and in “regulation of cellular processes” (Fisher p = 1e-13). “regulation of cellular processes” was enriched in toxic proteins even after excluding transcription factors (Fisher p = 3e-5). Catalytic proteins were significantly depleted in toxic proteins (Fisher p = 3e-4). Dotted line indicates overall average in cytoplasmic proteins.

Functional classes associated with toxicity.

Percentage of toxic proteins is much higher in transcription factors (Fisher p = 1e-13) and in “regulation of cellular processes” (Fisher p = 1e-13). “regulation of cellular processes” was enriched in toxic proteins even after excluding transcription factors (Fisher p = 3e-5). Catalytic proteins were significantly depleted in toxic proteins (Fisher p = 3e-4). Dotted line indicates overall average in cytoplasmic proteins.

Predictive accuracy and independence of sequence and functional features

In order to assess the predictive power and independence of sequence and functional features identified, we build a Random Forest model [16]. Using positively charged residue count, pI, length, transcription factor, regulatory and catalytic function information, the model can predict toxicity with area under receiver operating characteristic curve (ROC-AUC) of 0.72 (Figure S1), showing that these few features have enough information to predict protein toxicity with reasonable accuracy. A random predictor would have ROC-AUC of 0.5, while a perfect predictor would have ROC-AUC of 1. Functional classes (transcription factor, regulatory and catalytic function information) alone predict toxicity with ROC-AUC of 0.58, while sequence features (positively charged residue count, pI, and length) alone predict toxicity with ROC-AUC of 0.67. Increase in accuracy by adding functional and sequence features (Figure 5) indicate at least partial independence of these features in predicting toxicity.
Figure 5

Independence of sequence and functional properties in predicting toxicity.

Prediction accuracy (ROC-AUC) of overexpression toxicity from sequence (positively charged residue count, pI, and length) and functional features (transcription factor, regulation and catalytic function information). Combining sequence and functional features increases the predictive power indicating at least their partial independence. TF = transcription factors, pI = Isoelectric point.

Independence of sequence and functional properties in predicting toxicity.

Prediction accuracy (ROC-AUC) of overexpression toxicity from sequence (positively charged residue count, pI, and length) and functional features (transcription factor, regulation and catalytic function information). Combining sequence and functional features increases the predictive power indicating at least their partial independence. TF = transcription factors, pI = Isoelectric point.

Discussion

Here we analyze a number of sequence and functional properties associated with proteins that show overexpression toxicity in E. coli to understand its mechanism. While membrane proteins are known to be highly toxic when overexpressed [2], we find that periplasmic proteins, which generally do not have trans-membrane segments, also show very high toxicity. The Sec pathway is the major route of protein translocation across and insertion into inner membrane of E. coli. The fact that most secretory proteins show very high toxicity is consistent with the hypothesis that saturation of Sec translocation machinery by overexpression of secretory proteins is responsible for their extreme toxicity [15]. Considering the high and potentially different mode of toxicity of secretory proteins, we focused on the mechanism of toxicity of cytoplasmic proteins. While a number of studies have analyzed sequence features associated with overexpression of soluble proteins in E. coli and bacterial cell-free systems [17]–[27], none has examined the sequence and functional features associated with overexpression of endogenous proteins on the growth of E. coli. We find that number of positively charged residues is the most predictive feature of overexpression toxicity (Figure 3) of cytoplasmic proteins. Toxic proteins have significantly higher isoelectric point overall (Figure 3a), though proteins with very low isoelectric point are also more toxic (Figure 3b). These results indicate that electrostatic mis-interactions induced by increased concentration mediate toxicity of cytoplasmic proteins in E. coli. Toxic proteins were also significantly longer; particularly very large proteins (top 5% in length, Figure 3b) were highly toxic. The larger surface area of longer proteins may allow more mis-interactions. Misfolding and self-aggregation (inclusion bodies) is commonly observed during protein overexpression in E. coli and may be toxic [28]. However higher charge on proteins is often associated with increased solubility and lower self-aggregation [17], [19]–[21], [23], [25]–[27], [29], suggesting that misfolding and self-aggregation is not the major mechanism of overexpression toxicity. Indeed in vitro protein solubility information [25] did not increase prediction accuracy of the random forest model trained on length, pI and number of positively charged residues. Furthermore, toxic proteins do not have higher hydrophobicity than non-toxic proteins (mean hydrophobicity 0.472 vs. 0.475 respectively, two tailed t-test p = 0.03), which is often associated with self-aggregation. Chaperone (GroEL) substrates [30], [31] are also not enriched in toxic proteins (Fisher p = 0.5). It is tempting to speculate that high toxicity of positively charged proteins is due to their interactions with negatively charged DNA, which may cause transcription dysregulation (also see below) preventing the expression of essential proteins. The larger surface area of longer proteins may allow more mis-interactions. The importance of charged residues in protein sequence for toxicity suggests that reducing the charged residues (particularly positively charged residues) may reduce the overexpression toxicity. This could be done by removing charged stretches from the protein or making site directed mutagenesis. Reducing the length of the protein in cases where protein is very long (e.g. cloning different domains separately) may also be useful in decreasing overexpression toxicity. While we have used simple measures of charge of the protein, utilizing more sophisticated features that take into account the distribution of charged residues on the sequence and structure of the protein may allow better prediction of toxicity and may also be useful in designing of antimicrobial peptides, whose activity is attributed to their charge [32]. In the functional classes, transcription factors are highly toxic on overexpression (Figure 4). Transcription factors have only marginally higher positively charged residues than non-transcription factors (median 30 vs. 27 respectively, Wilcox p 0.03) and are not different in length (median 264 vs. 265 respectively, Wilcox p 0.8), thus this effect is not dependent on these features. We hypothesize that overexpression of transcription factors may allow them to bind to non-native DNA sites, which may saturate the transcription machinery and prevent transcription of proteins important for cell survival. Regulatory proteins excluding transcription factors were also enriched in toxic proteins, though less than transcription factors (Figure 4). Overexpression of regulatory proteins may also eventually cause transcription dysregulation leading to growth inhibition. Catalytic proteins is an interesting class because it shows significantly less toxicity despite the fact that these have significantly higher positively charged residues than non-catalytic proteins (median 33 vs. 21 respectively, Wilcox p 3e-74) and are longer (median 327 vs. 180 respectively, Wilcox p 5e-113). As expected, within catalytic proteins, toxic proteins have significantly higher positively charged residues than non-toxic proteins (median 48 vs. 31 respectively, Wilcox p 3e-20) and are longer (median 430 vs. 315 respectively, Wilcox p 1e-17). At present it is unclear as to why catalytic proteins are less sensitive to overexpression toxicity than non-catalytic proteins. Dosage balance hypothesis posits that imbalance in the relative amount of proteins in protein complex (over/under expression) would disrupt its functionality [33]. Thus complexes should be enriched in toxic proteins. While we find that “macromolecular complexes” are enriched in toxic proteins (28% toxic proteins in “macromolecular complexes” vs. 17% in rest, Fisher p = 1e-4), these proteins also have significantly more positively charged residues (Wilcox p = 8e-5). Further, adding protein complex information did not increase the predictive power of random forest model. These observations suggest that enhanced toxicity of proteins in complexes is also due to electrostatic mis-interactions rather than dosage imbalance. How does mechanism of overexpression toxicity compare between yeast and E. coli? In yeast, proteins showing overexpression toxicity are highly enriched in structural disorder, which is widely associated with protein-protein and protein-DNA interactions in eukaryotes [11]–[13], so their increased levels may lead to large number of promiscuous interactions [10] and toxicity. E. coli, like most bacteria have few disordered regions and proteins, so a priori it might be expected that mechanism of overexpression toxicity be very different in E. coli and yeast. However, we find that in E. coli, sequence features associated with promiscuous electrostatic interactions are significantly associated with overexpression toxicity. These results show that basic mechanism of overexpression toxicity by mis-interactions is common between yeast and E. coli (and hence elephants [34]), suggesting that this may be a universal phenomenon.

Materials and Methods

The development of ASKA library is described by Kitagawa et al. [2]. Data on overexpression toxicity of proteins was downloaded from http://ecoli.naist.jp/GB8-dev/index.jsp?page=resource_download.jsp. Trans-membrane segments were predicted using TMHMM [35]. Gene ontology class and localization information (“membrane”, “outer membrane” and “periplasmic space”) was obtained from ECOCYC database [36]. For functional analyses we considered all GO function and process classes with about 200 or more proteins. There were 18 such classes. Protein hydrophobicity was calculated with Kyte and Doolittle hydrophobicity scale normalized from 0 to 1. We used Random forest to test the predictive power and independence of sequence and functional features. Random forest is a statistical learning algorithm that uses an ensemble of decision trees [16], [37]. In random forests, prediction error is estimated internally without the need for explicit cross-validation as each decision tree is constructed using a different bootstrap sample of the original data and approximately one-third of the cases are left out from the training sample and not used in the construction of the tree. Thus, these left-out cases can be used to estimate prediction error. As number of toxic proteins was much smaller than non-toxic proteins, we randomly selected equal number of non-toxic proteins to build the classifier. This was done 10 times and average area under receiver operating characteristic curve (ROC-AUC) is reported as an accuracy measure. ROC curve illustrating the accuracy of toxicity prediction based on sequence and functional features. Considering all sequence (positively charged residue count, pI, and length) and functional features (transcription factor, regulation and catalytic function information), the area under the ROC curve is 0.72. (TIFF) Click here for additional data file.
  36 in total

1.  Analysis of high throughput protein expression in Escherichia coli.

Authors:  Yair Benita; Michael J Wise; Martin C Lok; Ian Humphery-Smith; Ronald S Oosting
Journal:  Mol Cell Proteomics       Date:  2006-07-04       Impact factor: 5.911

2.  Protein solubility: sequence based prediction and experimental verification.

Authors:  Pawel Smialowski; Antonio J Martin-Galiano; Aleksandra Mikolajka; Tobias Girschick; Tad A Holak; Dmitrij Frishman
Journal:  Bioinformatics       Date:  2006-12-06       Impact factor: 6.937

3.  Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins.

Authors:  Tatsuya Niwa; Bei-Wen Ying; Katsuyo Saito; WenZhen Jin; Shoji Takada; Takuya Ueda; Hideki Taguchi
Journal:  Proc Natl Acad Sci U S A       Date:  2009-02-27       Impact factor: 11.205

4.  Intrinsic protein disorder and interaction promiscuity are widely associated with dosage sensitivity.

Authors:  Tanya Vavouri; Jennifer I Semple; Rosa Garcia-Verdugo; Ben Lehner
Journal:  Cell       Date:  2009-07-10       Impact factor: 41.582

Review 5.  Biodegradation of aromatic compounds by Escherichia coli.

Authors:  E Díaz; A Ferrández; M A Prieto; J L García
Journal:  Microbiol Mol Biol Rev       Date:  2001-12       Impact factor: 11.056

6.  A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli.

Authors:  Susan Idicula-Thomas; Abhijit J Kulkarni; Bhaskar D Kulkarni; Valadi K Jayaraman; Petety V Balaji
Journal:  Bioinformatics       Date:  2005-12-06       Impact factor: 6.937

7.  Artificial gene amplification reveals an abundance of promiscuous resistance determinants in Escherichia coli.

Authors:  Valerie W C Soo; Paulina Hanson-Manful; Wayne M Patrick
Journal:  Proc Natl Acad Sci U S A       Date:  2010-12-20       Impact factor: 11.205

8.  Complete set of ORF clones of Escherichia coli ASKA library (a complete set of E. coli K-12 ORF archive): unique resources for biological research.

Authors:  Masanari Kitagawa; Takeshi Ara; Mohammad Arifuzzaman; Tomoko Ioka-Nakamichi; Eiji Inamoto; Hiromi Toyonaga; Hirotada Mori
Journal:  DNA Res       Date:  2006-01-09       Impact factor: 4.458

9.  Proteome-wide analysis of chaperonin-dependent protein folding in Escherichia coli.

Authors:  Michael J Kerner; Dean J Naylor; Yasushi Ishihama; Tobias Maier; Hung-Chun Chang; Anna P Stines; Costa Georgopoulos; Dmitrij Frishman; Manajit Hayer-Hartl; Matthias Mann; F Ulrich Hartl
Journal:  Cell       Date:  2005-07-29       Impact factor: 41.582

10.  High-throughput expression of C. elegans proteins.

Authors:  Chi-Hao Luan; Shihong Qiu; James B Finley; Mike Carson; Rita J Gray; Wenying Huang; David Johnson; Jun Tsao; Jérôme Reboul; Philippe Vaglio; David E Hill; Marc Vidal; Lawrence J Delucas; Ming Luo
Journal:  Genome Res       Date:  2004-10       Impact factor: 9.043

View more
  5 in total

Review 1.  Gene therapy and gene correction: targets, progress, and challenges for treating human diseases.

Authors:  Matthew R Cring; Val C Sheffield
Journal:  Gene Ther       Date:  2020-10-09       Impact factor: 5.250

2.  Coupling between noise and plasticity in E. coli.

Authors:  Gajinder Pal Singh
Journal:  G3 (Bethesda)       Date:  2013-12-09       Impact factor: 3.154

3.  Evaluating the fitness cost of protein expression in Saccharomyces cerevisiae.

Authors:  Katarzyna Tomala; Ryszard Korona
Journal:  Genome Biol Evol       Date:  2013       Impact factor: 3.416

4.  How do eubacterial organisms manage aggregation-prone proteome?

Authors:  Rishi Das Roy; Manju Bhardwaj; Vasudha Bhatnagar; Kausik Chakraborty; Debasis Dash
Journal:  F1000Res       Date:  2014-06-27

5.  Transient protein-protein interactions perturb E. coli metabolome and cause gene dosage toxicity.

Authors:  Sanchari Bhattacharyya; Shimon Bershtein; Jin Yan; Tijda Argun; Amy I Gilson; Sunia A Trauger; Eugene I Shakhnovich
Journal:  Elife       Date:  2016-12-10       Impact factor: 8.140

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.