Literature DB >> 24128940

Evaluating the fitness cost of protein expression in Saccharomyces cerevisiae.

Abstract

Protein metabolism is one of the most costly processes in the cell and is therefore expected to be under the effective control of natural selection. We stimulated yeast strains to overexpress each single gene product to approximately 1% of the total protein content. Consistent with previous reports, we found that excessive expression of proteins containing disordered or membrane-protruding regions resulted in an especially high fitness cost. We estimated these costs to be nearly twice as high as for other proteins. There was a ten-fold difference in cost if, instead of entire proteins, only the disordered or membrane-embedded regions were compared with other segments. Although the cost of processing bulk protein was measurable, it could not be explained by several tested protein features, including those linked to translational efficiency or intensity of physical interactions after maturation. It most likely included a number of individually indiscernible effects arising during protein synthesis, maturation, maintenance, (mal)functioning, and disposal. When scaled to the levels normally achieved by proteins in the cell, the fitness cost of dealing with one amino acid in a standard protein appears to be generally very low. Many single amino acid additions or deletions are likely to be neutral even if the effective population size is as large as that of the budding yeast. This should also apply to substitutions. Selection is much more likely to operate if point mutations affect protein structure by, for example, extending or creating stretches that tend to unfold or interact improperly with membranes.

Entities: Chemical Disease Gene Species

Keywords: budding yeast; disordered proteins; membrane proteins; molecular evolution rate; protein overexpression

Mesh：

Substances：

Year: 2013 PMID： 24128940 PMCID： PMC3845635 DOI： 10.1093/gbe/evt154

Source DB: PubMed Journal: Genome Biol Evol ISSN： 1759-6653 Impact factor: 3.416

Introduction

Proteins constitute a major component of the dry mass of a cell. Synthesis of amino acids and subsequent assembly of polypeptides are costly. The two processes are estimated to consume about one-half of the ATP molecules in a growing yeast cell and involve a large fraction of its nucleic acids and ribosomal proteins (Verduyn 1991; Warner 1999). The huge cost of protein synthesis has been recognized as such for decades (Maaloe and Kjeldgaard 1966; Waldron and Lacroute 1975). More recently, it has been shown that newly assembled polypeptides are released into a crowded environment of macromolecules in which their folding is easily derailed (Ellis 2001). They often end up in a form that is not only unproductive but can also be toxic and sometimes resistant to degradation (Stefani and Dobson 2003; Winklhofer et al. 2008). However, while it is certain that the costs and risks associated with the turnover of the total protein load are large, it remains unknown how much individual protein species differ in this respect. In theory, it is possible to calculate the cost of protein synthesis because the substrates and the process are well known. However, the required parameters are many and they have not yet been estimated with sufficient accuracy (von der Haar 2008; Siwiak and Zielenkiewicz 2010). Because the routes of folding and degradation for different polypeptides are still underway, the energy or fitness costs associated with such events are presently impossible to assess (Hartl et al. 2011). Thus, it remains a great challenge in current research to provide analytical, experimental, or computational estimates of selective pressures acting on individual proteins. Evidence that different proteins experience different selective forces on traits other than their primary functions can be extracted from the DNA sequence. In particular, it is well established that the rate of molecular evolution differs widely between genes and that those expressed the most are the ones that change the least (Sharp 1991; Pal et al. 2001). One explanation could be that the highly expressed genes mutate at a lower rate, a possibility that has gained some support recently (Martincorena et al. 2012). Most researchers, however, believe that more highly expressed genes are under stronger purifying selection. Some of the tentative explanations invoke functional arguments: importance (essentiality) of function, multiplicity of functions, centrality to metabolic networks, number of transcription factors assisting expression, or enrichment for genetic and/or physical interactions (Fraser et al. 2002; Jordan et al. 2003; Bloom and Adami 2004; Wall et al. 2005; Pal et al. 2006; Vitkup et al. 2006; Xia et al. 2009). For each of these factors, however, correlation with the rate of evolution is much lower than that for the level of gene expression (Rocha 2006; Wang and Zhang 2009). Thus, it appears that it is the amount of protein product that matters most. This could mean that selection tends to purge mutations located in highly expressed genes because they lead to a greater waste of resources (Barton et al. 2010; Vieira-Silva et al. 2011). Not only efficient use of materials and energy but also a high rate of translation can be important. This could result in selection for optimal codon usage in the highly expressed genes (Akashi 2001; Plotkin and Kudla 2010). The more protein molecules, the higher the toxic effect after misfolding; therefore, misfolding-resistant sequences should especially be preserved in highly expressed genes, which would constrain their evolution (Drummond et al. 2005; Drummond and Wilke 2008; Yang et al. 2010). In sum, there is no lack of hypotheses for how the amount of synthesized protein could dictate the rate of molecular evolution. However, these hypotheses have been conceived through comparative analyses of DNA/protein sequences and have been verified mostly in the same way. In this article, we report the results of a study aimed at testing these hypotheses experimentally, which has so far been addressed by only a few researchers. The postulate of controlled alteration of selected determinants of the protein production cost has proved difficult to implement. For example, changing the actual codon usage to a devised one alters the stability and hence the abundance of the resulting mRNA variants. The effect of mRNA abundance can be more important than the sought effect of mRNA composition (Kudla et al. 2009; Agashe et al. 2013). Even the seemingly straightforward task of demonstrating that overproduction of unnecessary proteins is disadvantageous has proved challenging. There must be costs associated with synthesis of redundant polypeptides, but there are also costs of their presence in the cell and their interactions with cell structures (Stoebel, et al. 2008; Plata, et al. 2010; Eames and Kortemme 2012). Our approach is based on the assumption that universal costs of protein expression do exist and can be at least partly disentangled if the number and diversity of analyzed proteins are sufficiently large. We relied on a genomic collection of yeast strains, each overexpressing a single protein. Two previous studies measured approximately how much protein was overproduced and categorized the growth effects accompanying this overproduction (Gelperin et al. 2005; Sopko et al. 2006). One experiment measured fitness using a quantitative assay but the level of production was not estimated and the average production could not be calculated as the applied protocol of overexpression differed from those used earlier (Yoshikawa et al. 2011). We therefore carried out our own assays in which we stimulated genes to moderate protein overproduction, measured overexpressed protein levels quantitatively, and estimated the growth rate with high accuracy. We first examined our data by asking whether the fitness effect of overexpression was heavily dependent on the cellular role of a tested gene. It was not, as we found by reviewing gene annotations. This was encouraging because we could assume that the effect of metabolic deregulation would not obscure the effect of carrying useless or toxic protein molecules. We thus asked which of the several protein properties could be the best predictor of fitness variation. We confirmed previous reports showing that proteins containing transmembrane (Kitagawa et al. 2006; Osterberg et al. 2006) and disordered (Vavouri et al. 2009; Ma et al. 2010) regions are especially costly to fitness when overexpressed. Crucially, we compared quantitatively these costs with the cost of expressing normal (well-structured cytosolic) proteins. We found that the cost of expressing well-structured cytosolic proteins is very low when scaled to one amino acid addition (and thus also substitution).

Materials and Methods

Strains

We used a previously constructed collection of single yeast open reading frames (ORFs), each with the same inducible promoter P followed by the same tandem affinity tag (His6, HA epitope, protease 3C site, ZZ domain, 19 kDa) cloned into a multicopy plasmid (Gelperin et al. 2005). Plasmids were hosted by the haploid yeast strain Y258. Most of the cloned genes had been tested for errors; only approximately 3% of them were likely to have an undetected mutation (Gelperin et al. 2005).

Fitness Assays

The overexpression strains were inoculated directly from plates shipped by the distributor (Open Biosystems) into 200 μl of SC with glucose but lacking uracil to stabilize the plasmid. To stimulate overexpression, we used synthetic complete (SC) with raffinose as a source of carbon and galactose as an inducer, according to a protocol described in the original study that led to moderate overexpression. We then transferred 10 μl aliquots of each culture into 190 μl of fresh glucose medium and incubated for 48 h. From these cultures, 10-μl aliquots were transferred to 135 μl of SC with raffinose for another 48 h. The raffinose cultures were diluted ten times and the optical densities (ODs) measured. These cell suspensions were diluted again at 1:50 in SC with raffinose and galactose (2% each). In this growth/induction medium, the cultures were allowed to grow for 20 h, at which point their ODs were determined. The ratio of the two OD measurements, which were corrected for the dilution factor, served to calculate the number of cell doublings for each culture. All growth assays were carried out at 30 °C.

Protein Assays

Overproduction of proteins was induced by transferring cells sequentially from glucose to raffinose, and then to raffinose/galactose medium for 8 h. The cells were then centrifuged, washed with ice-cold water, and frozen. To extract proteins, the cells were beaten with glass beads in 100 μl of lysis buffer (50 mM Tris–HCl, pH 7.5, 0.5% sodium dodecyl sulphate, 0.1 mM ethylenediaminetetraacetic acid, protease inhibitors) for 4 h at 4 °C. Cell remnants were then spun down, and the supernatants were collected. Total protein content was determined using a bicinchoninic acid (BCA) protein assay. For a competitive ELISA assay, plates were coated overnight at 4 °C with 0.05 μl of normal rabbit serum (Pierce) diluted in 100 μl of 0.2 M carbonate–bicarbonate buffer, pH 9.4. After washing, plates were blocked with 300 μl of 2% bovine serum albumin (BSA) for 24 h. The yeast protein extracts were mixed with protein A conjugated to peroxidase (Pierce) then 100 μl of the resulting mixture was added to the blocked plate wells, for a total 10 μg of total yeast protein and 25 ng (∼26 μU) of protein A per well. After 1 h of incubation, the mixtures were discarded and the wells washed and filled with 100 μl of the 3,3′,5,5′-tetramethylbenzidine (TMB) substrate. The reaction was terminated after 30 min with 100 μl of 2 M H2SO4, and then, the absorbance at 450 nm was measured. All washing steps were performed with 200 μl of phosphate-buffered saline containing 0.05% Tween 20. One of the tagged proteins (Ade2p) was purified, diluted into a gradient of known concentrations, and used as a standard to calibrate the reads.

Gene Ontology and Protein Properties

To analyze the GO categories (Saccharomyces Genome Database [SGD]), we applied an ANOVA model in which each of the 5,084 overexpressed genes was described by the Yeast Slim categories taking values of zero or one (absent or present). We used the “lm” function of the R package, followed by the “step” function (based on Akaike Information Criterion [AIC]) to reduce the number of predictor variables by eliminating the nonsignificant ones (R Development Core Team 2010). The analyses were performed separately for the molecular function, cellular component, and biological process classifications. As these classifications contained tens of terms, we did not analyze interactions between them because the latter were very numerous and usually contained too few data points to be meaningful. Protein properties were analyzed by implementing a multiple regression model using the “lm” function. Continuous predictor variables were log-transformed (except for gravy score and mRNA 5′ folding energy); a small constant was added to those with zero values before transformation (Wall et al. 2005). The continuous predictor variables included: mRNA abundance (Garcia-Martinez et al. 2004), protein half-life (Belle et al. 2006), intrinsic disorder/protein length + 0.01 (Linding et al. 2003), protein length (SGD), CAI+0.1 (SGD), gravy score (SGD), and protein abundance, that is, the number of molecules per protein species (Ghaemmaghami et al. 2003). To calculate the energy of structures at the 5′-end of mRNAs, we used the Vienna RNA Package 2.0 (Lorenz et al. 2011) for stretches extending from the −4 to +37 nucleotide positions (Plotkin and Kudla 2010). All continuous predictor variables were standardized prior to analysis. There were also two categorical variables: physical interaction status (not hub, intermediate number of interactions, party hub, and date hub) (Han et al. 2004; Ekman et al. 2006) and the presence of transmembrane segments (not predicted, predicted by only one study, and predicted by two studies) (Persson and Argos 1994; Krogh et al. 2001). ORFs with missing values in any of the predictor variables were excluded from this analysis. There were 2,913 ORFs with a complete set of predictors, and only those were included in the final orthogonal model. We included all ten listed variables in the model and the first order interactions between them (except for interactions between the two categorical variables). The entire procedure was repeated 40 times with random permutations of the order of categories in the model. The P values for predictor variables were averaged over repeats (geometrically).

Results

Fitness Effects of Moderate Overexpression of Genes Are Small

We found that an overproduced protein species constituted typically approximately 1% of the total protein amount (more detailed data reported later), which is much less than doses known to be severely toxic (Dong et al. 1995; Geiler-Samerotte et al. 2011). We measured fitness by estimating how many cell divisions occurred in single-strain liquid cultures over a period of about 1 day (see Materials and Methods). This included both lag and growth phases resulting in an average number of doublings of 7.75 (median 7.83) with a standard deviation of 0.45. (The cultures reached about one-fourth of their final density.) Thus, variation in fitness was not high, especially given that a sizable portion of it came from differences between plates and was eliminated from all subsequent analyses by within-plate normalization (see Materials and Methods). Previous studies evaluated the growth of colonies on common agar plates (Gelperin et al. 2005; Sopko et al. 2006) or in individual liquid cultures over a shorter time interval (Yoshikawa et al. 2011; Makanae et al. 2013). Those earlier estimates generally agree with ours (supplementary fig. S1, Supplementary Material online). We sought to assay fitness in a way that would increase the role of fast growth, and thus fast protein processing, in the final measure of fitness. Importantly, we wanted to compare quantitative fitness estimates with quantitative estimates of protein overproduction for a large number of individual clones, which had not been performed in previous studies. Figure 1 shows the distribution of normalized fitness estimates for 5,182 strains containing a unique cloned ORF known to express a protein (SGD). The intraclass correlation coefficient (ICC) calculated over four independent repeats was 0.966, indicating that repeatability of our fitness measurements was high. Good repeatability within a strain and large differences between strains (the shape of clouds) suggest that factors other than measurement errors were responsible for much of the fitness variation. Some factors, such as the average copy number of individual plasmids, could not be controlled in this experimental system. All individual records, both normalized and nonnormalized, are listed in supplementary table S1, Supplementary Material online.

The effects of single gene overexpression on growth. The number of cell divisions in single-strain cultures was estimated four times independently. The estimates were divided by the median values of relevant replications to obtain normalized values. (a) The repeatability of the individual normalized fitness estimates and (b) the frequency distribution of strains’ means. The vertical dashed line marks the slowest growing 91 strains. These were removed from all of the following statistical analyses to make the distribution symmetric and closer to normal. (This exclusion was unlikely to affect our analyses. For example, we correlated fitness with ten properties of proteins for all data and those lacking the 77 data points. For data analyzed in this way, pairs of Pearson’s coefficients were themselves very much correlated: Pearson’s r = 0.988, Spearman’s r = 1).

Functional Categorization Explains Little of the Gene Overexpression Effects

As reported later in detail, the median content of overexpressed proteins was approximately 400 times higher than the median content of normally expressed ones (Ghaemmaghami et al. 2003). This could potentially disturb at least some cellular functions. The overexpressed genes fell into 22 Yeast Slim GO cell component categories, 41 molecular function categories, and 100 biological process categories (we decided to reduce the biological process categories to 40 by combining some of the most similar ones). Within each of these three classifications, we first applied a linear model including all categories and then progressively simplified it by eliminating statistically nonsignificant categories (see Materials and Methods). We obtained a relatively low number of potentially important predictors shown in figure 2. There were a few categories associated with increased fitness. These suggest that speeding up turnover of nucleotides and adjusting oxidative metabolism could have a positive effect on fitness. Negative effects were more numerous and larger. They were linked to cell wall and membrane structures. Although these factors were significant on a statistical level, they had very small average effects, approximately 0.005, which is clearly less than the standard deviation of the overall distribution of normalized fitness estimates, 0.032 (fig. 1b). The observed weak dependence of fitness effects on the functions of the overexpressed proteins may be specific to our experimental system. Other arrangements, for example, Escherichia coli and high overexpression, have shown that unnaturally high levels of transcription factors and regulatory proteins can be toxic (Singh and Dash 2013).

Gene Ontology categories as predictors of the overexpression cost. The graph shows the highest and most statistically significant deviations of the Yeast Slim category means from the grand mean (not fitness gains or losses when compared with a strain with no overexpression). To further test whether growth was indeed relatively insensitive to metabolic deregulation, we focused our analyses on enzymes alone. We revisited a study in which the molecular evolution of enzymes was considered dependent on their metabolic centrality and connectivity (Vitkup et al. 2006). Connectivity of an enzyme had been calculated as the number of other metabolic enzymes that produce or consume the enzyme’s products or reactants. In our data set, 329 of the 350 enzymes examined in the original study were included. We used the same categorization of metabolic connectivity but did not find it helpful in explaining the observed variation in the fitness response to gene overexpression (r = −0.029, P = 0.6). Apparently, the cell’s metabolic network is well buffered against perturbations in the expression level of participating enzymes, at least when single enzymes are overabundant. As reported earlier, most cellular structures and processes were also remarkably resistant to such alterations. We therefore decided that it would be acceptable to execute the analysis of protein properties for all genes together, ignoring their cellular roles and making the statistics both simpler and more powerful.

Only a Few Protein Properties Correlate with the Cost of Overexpression

A review of theoretical and empirical studies disclosed ten properties of proteins/mRNAs that were frequently examined as factors potentially affecting the rate of evolution. The dependence of fitness on the most significant factors is shown in figure 3a. The remaining factors are presented in supplementary figure S2, Supplementary Material online. These graphs illustrate how the fitness of the overexpression strains correlates with each characteristic separately. They show that although the effects of some factors (e.g., protein length) are small, they can be remarkably regular. In a formal statistical analysis, we used a linear model, which examined jointly all single factors and selected interactions (see Materials and Methods). The results are reported more thoroughly in supplementary table S2, Supplementary Material online. Here, in figure 3b, we present only summaries of statistics for individual factors. Some factors, such as protein half-life, codon adaptation index, frequency of physical interactions, abundance under normal expression, energy of 5′ mRNA fold, and gravy score proved nonsignificant. Two of the statistically significant factors, the presence of transmembrane regions and the proportion of protein length occupied by sequences predicted to be loosely shaped (intrinsically disordered), refer to properties that become meaningful only after a protein chain is synthesized and folded. Other properties may be important at the time of synthesis. There was a negative correlation between the level of mRNA under normal expression and fitness. This could mean that overexpression of the normally common transcripts tends to deplete optimal tRNAs for production of redundant proteins and thus slow down elongation of those needed. However, the effect of high CAI on fitness, although negative, was not statistically significant. The energy of the folding of 5′ mRNAs was also neutral, suggesting that transcripts with rigid spatial structures did not trap too many ribosomes (Plotkin and Kudla 2010). It thus appears that there is no shortage of ribosomes, and possibly optimal tRNAs, when 1% of translation is useless, at least under the growth conditions applied here. Finally, there was a negative correlation between protein length and fitness indicating that the amount of an overproduced protein mattered (because all overexpressed proteins had the same promoter). This relation attracted our attention especially because it appeared to be very regular over the entire range of protein lengths (fig. 3a). We therefore decided to test experimentally whether the length of a protein is a good proxy for its amount under overexpression.

Protein properties and the fitness cost of overexpression. (a) Examples of fitness predictors (only the most significant predictors are shown; the remaining ones are in supplementary fig. S2, Supplementary Material online). Moving averages are shown as red lines for continuous variables. (b) Results of multifactorial analysis. Statistical significance of positive (green) and negative (red) effects is shown.

Relating Fitness Cost to the Amount of Protein

We estimated the cellular level of overproduced protein for a large sample of strains. Repeatability of estimates obtained by competitive ELISA was high (ICC = 0.944, n = 719, P ≪ 0.001) and centered on a median of 0.63% (fig. 4a). The relationship between the amount of overproduced protein and its length is shown in figure 4b; Pearson’s correlation coefficient was significant (r = 0.136, df = 717, P = 0.0002). To find a quantitative relation between the length of a protein and its amount under overexpression, we used a data set without the outliers seen in figure 4b (see supplementary methods, Supplementary Material online for details). We found that when the length of a protein doubles, its amount under overexpression increases by about one-half (the slope of a linear regression with both axes log-transformed was 0.47). We could then assign to every protein its expected amount under overexpression as a function of its length. From the common model of multiple regression, we found the relationships between the length of a protein (and its amount), the presence of transmembrane regions, and the presence of disordered regions, the three factors jointly effecting fitness (supplementary table S3, Supplementary Material online). This information is summarized in table 1, which lists the cost of expressing different proteins per 1% of total protein mass and per amino acid. To get the latter estimates, we assumed that the total mass of proteins in the yeast cell is 6.0 × 10−12 g (Sherman 2002). Knowing the number of molecules (Ghaemmaghami et al. 2003) and their molecular weights, we could calculate the total weight of every protein. The contribution of special regions was calculated from the proportions of the transmembrane or disordered regions calculated for every individual protein species (Persson and Argos 1994; Krogh et al. 2001; Linding et al. 2003). One implicit assumption that could introduce only a minimal bias to our estimates is the assumption that the per amino acid weight of the transmembrane, disordered, and other regions was equal (see supplementary methods [Supplementary Material online] for more details regarding calculations).

Table 1

Fitness Cost of Protein Expression

Protein Type^a	1% of Total Protein^b (Mean ± SE)	Special Region Fraction (Mean ± SD)	Cost Per Single aa^c (Mean ± SE)
Standard	0.023 ± 0.005	—	(7.32 ± 1.63) × 10⁻¹¹
Disordered (added)	0.017 ± 0.004	0.11 ± 0.08	(6.76 ± 1.47) × 10⁻¹⁰
Trans-membrane (added)	0.012 ± 0.002	0.13 ± 0.10	(4.78 ± 0.82) × 10⁻¹⁰

aProteins were standard (that is, cytosolic and well structured), contained disordered regions, and were located in membranes. The proportion of protein length taken by the disordered or transmembrane regions is shown in the middle column.

bThe fitness cost of producing 1% of superfluous polypeptide (standard), plus the costs added by the presence of disordered or transmembrane regions.

cThe fitness cost of expressing one amino acid in one protein molecule if the amino acid is located in standard or special regions.

The level of protein overexpression. (a) Frequency distribution of the amount of protein at the normal (empty bars) and overexpressed (filled bars) levels. Normal protein levels were taken from a previous study (Ghaemmaghami et al. 2003) and overexpression estimates were obtained in this study using a competitive ELISA assay. (b) The relationship between protein length and protein overexpression level (see supplementary methods, Supplementary Material online). Fitness Cost of Protein Expression aProteins were standard (that is, cytosolic and well structured), contained disordered regions, and were located in membranes. The proportion of protein length taken by the disordered or transmembrane regions is shown in the middle column. bThe fitness cost of producing 1% of superfluous polypeptide (standard), plus the costs added by the presence of disordered or transmembrane regions. cThe fitness cost of expressing one amino acid in one protein molecule if the amino acid is located in standard or special regions. Table 1 shows that the average effect of having a disordered region or a transmembrane domain is remarkable but not excessively large. On average, disordered regions nearly doubled the fitness cost of the entire protein. Similarly, the membrane proteins were substantially more costly than were the cytosolic ones. The costs expressed per amino acid show the relative fitness changes of expanding some regions at the expense of other regions. They may also serve to compare fitness costs of proteins expressed at different levels. The yeast proteins are represented by very different numbers of molecules per cell under natural expression, from 10 to 1 million (Ghaemmaghami et al. 2003). In the analyses described earlier, either some of the characteristics borrowed from other studies or our own measurements were lacking for a number of genes. We asked which of our results would hold if a single analysis were performed for those genes only for which both the fitness estimate, as well as the protein overexpression level, and all other variables were known. There were only 423 such genes. Detailed results are presented in supplementary table S4, Supplementary Material online. Briefly, the presence of transmembrane domains remained the most significant factor. Three factors pertaining to protein abundance—the measured level, the reported half-life, and the predicted length—were also significant or nearly significant. This latest finding is yet another indication that it is not only the structural properties of a redundant protein but also its amount that contributes to toxicity.

Discussion

We found that overexpression of single genes in Saccharomyces cerevisiae generally leads to moderate but variable effects on growth. This variation is partly explained by the properties of the overexpressed protein molecules and the roles they play in cellular metabolism. Cell growth also correlated to the amount of overexpressed protein, indicating that synthesis and processing of useless polypeptides lowers the efficiency of cell growth. This particular cost was relatively small, which explains why it has not been convincingly demonstrated in former studies. Proteins with disordered or intramembrane regions were especially damaging to fitness when overexpressed. Based on these findings, we propose that an addition, or exchange, of a single amino acid is of little consequence for fitness unless it extends or creates protein regions forming critical structures. There are two possible explanations why the disordered and transmembrane regions are especially damaging to fitness when overexpressed. One of them concentrates on overload, the other on toxicity. Considering overload, we note that the summed mass of all membrane proteins is 15% of the total protein content in a yeast cell. Similarly, the disordered stretches of polypeptides make up approximately 12% of total protein. Therefore, the same weight of an extra 1% of protein constitutes a considerably higher overload in terms of proportion added to the proteins that are in membranes or are disordered. The costs associated with transmembrane proteins can include membrane piercing, interfering with other membrane proteins, or engaging membrane-specific folding pathways. Similarly, if maintaining the total pool of loosely structured proteins poses some special cost to the cell, then every overexpressed member of this group adds a higher proportion to this cost. Generally, the costs of overload could result from expressing those proteins that are more expensive/risky to keep in the cell even if they function as expected. A type of overload hypothesis has been proposed in which malfunctioning of membranes occurs in response to the overexpression of a membrane protein (Eames and Kortemme 2012). On the contrary, the cost of toxicity means that overexpressed protein chains acquire new and unwanted functions. It is possible that both the disordered and membrane proteins are especially likely to undergo such transformation. The disordered or unstructured regions have important functions in signaling, control, and regulation (Dunker et al. 2008). Proteins with such regions interact with one another and with unrelated proteins, which leads to misfolding and aggregation (Uversky et al. 2008; Vavouri et al. 2009; Olzscha et al. 2011). Aggregates tend to expose hydrophobic surfaces and therefore tend to illegitimately penetrate and damage cellular membranes (Kourie and Henry 2002; Stefani 2008). Even the programmed formation of transmembrane domains can be sensitive to crowding and nonprescribed interactions with other regions of polypeptides (Levine et al. 2005; Mackenzie 2006; Skach 2009; Chakrabarti et al. 2011). In sum, there are good hypothetical explanations why transmembrane and disordered proteins are especially likely to be overloaded or driven into toxicity when overexpressed. However, substantial efforts would be needed to find which of the two possible mechanisms is actually occurring when a particular protein is overexpressed. There are two other properties of proteins that correlated with the cost of overexpression: the length of the polypeptide and the abundance of the cognate mRNA under normal expression. As explained in the Results, we believe the two traits are simply correlated with the amount of useless protein and that this unnecessary burden is the real cause of fitness decrease. We base our assumption on the remarkable regularity of the relationship between polypeptide length and fitness loss, as well as on a statistically significant relation between polypeptide length and an actual abundance of overexpressed protein in the cell. We considered two alternative hypotheses. One assumes that long proteins are disproportionally more likely to misfold and thus overexploit molecular chaperones. To test this, we asked whether the overexpression of proteins known to interact with molecular chaperones had more substantial effects on fitness. We do not report these tests because we did not find any relationship between the fitness cost and the frequency of interactions with single chaperones (Bogumil et al. 2012), sets of chaperones revealed in large-scale studies (Gong et al. 2009), or smaller but carefully confirmed chaperone assemblages (Hartl et al. 2011). These results are in accord with a report suggesting that chaperones are efficient enough to handle a load of misfolded proteins that is substantially higher than 1% (Vabulas and Hartl 2005). Another alternative explanation, that long proteins have more domains and thus are more damaging to the cellular regulatory mechanisms, has been tested and rejected (see Results). We therefore propose that our observed negative effect of protein length on fitness reflects the general cost of protein processing, which includes all expenses involved in protein synthesis, maturation, maintenance, and disposal. Our results can be used to address the question of whether natural selection is strong enough to prevent a single amino acid being added or exchanged for another one. The efficiency with which genomes and proteomes are purged of mutations depends not only on the strength of their effects but also on population size (Lynch and Conery 2003; Fernandez and Lynch 2011). Natural selection operates when 2Nes > 1, where Ne stands for effective population size and s for the selection coefficient. It is effective when the quotient is ten times higher. The effective population size of a species closely related to S. cerevisiae, S. paradoxus, was estimated at 8.6 × 106 (Tsai et al. 2008). We found that the average cost of processing one amino acid is approximately 7 × 10−11 (table 1), so this would be the cost of adding one unnecessary amino acid to one polypeptide and need to be multiplied by the number of affected molecules. It follows that to be nonneutral (2Nes > 1), a mutation of this type must hit a protein represented by more than 830 molecules per cell. In S. cerevisiae, some three-fourths of proteins meet this weaker criterion but only a small minority the stronger one (Ghaemmaghami et al. 2003). Thus, selection can possibly act on a single amino acid only if the effective population size is as large as in yeast and only if proteins are sufficiently abundant. The entire cost of this size would be at stake if an amino acid were to be deleted or inserted. Substitution would most likely still be less costly and thus more often neutral. In many organisms, the effective population size is much smaller, even by three orders of magnitude (Charlesworth 2009; Gossmann et al. 2012), making selection still less effective. Our empirical findings generally agree with the results of a former computational study. Expediting single atoms of the main components of yeast biomass (such as carbon or nitrogen) has been found selectively nonneutral for just approximately 1% of proteins (those most abundantly expressed). Only under starvation for those rarer, such as sulfur, a wasteful use of one atom (or an amino acid in which it resides) can be significant for a substantial proportion of proteins (Bragg and Wagner 2009). Considering the factors that could control the evolution of protein sequence, it is remarkable that the fitness costs associated with amino acids residing within the disordered or transmembrane regions were so much higher. It appears justifiable to speculate that natural selection would operate most intensely on mutations creating new or extending existing regions of danger. Not only mutations making misfolding or misinteraction unavoidable would be selected against (Yang et al. 2012) but also any changes in the DNA sequence that could increase the rate of transcriptional and translational errors resulting in alterations of the spatial structure of proteins (Drummond et al. 2005; Drummond and Wilke 2008). Such changes could result in selection coefficients that were higher by several orders of magnitude than those arising from amino acid substitutions in standard protein regions. This is because any unwinding of a polypeptide can involve dozens of amino acids, each being ten times more costly than it was in a safe structure. There is some evidence to suggest that selection preventing structural aberration can be strong (Chiti and Dobson 2006; Geiler-Samerotte et al. 2011), but further work is clearly needed to show that much or perhaps most of the variation in the rate of protein evolution can be attributed to selection, minimizing the danger of protein misfolding and toxicity.

Supplementary Material

Supplementary methods, tables S1–S4, and figures S1 and S2 are available at Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).

74 in total

1. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.

Authors: A Krogh; B Larsson; G von Heijne; E L Sonnhammer
Journal: J Mol Biol Date: 2001-01-19 Impact factor: 5.469

2. Comprehensive phenotypic analysis of single-gene deletion and overexpression strains of Saccharomyces cerevisiae.

Authors: Katsunori Yoshikawa; Tadamasa Tanaka; Yoshihiro Ida; Chikara Furusawa; Takashi Hirasawa; Hiroshi Shimizu
Journal: Yeast Date: 2011-02-22 Impact factor: 3.239

Review 3. Synonymous but not the same: the causes and consequences of codon bias.

Authors: Joshua B Plotkin; Grzegorz Kudla
Journal: Nat Rev Genet Date: 2010-11-23 Impact factor: 53.242

4. Proteins deleterious on overexpression are associated with high intrinsic disorder, specific interaction domains, and low abundance.

Authors: Liang Ma; Chi Nam Ignatius Pang; Simone S Li; Marc R Wilkins
Journal: J Proteome Res Date: 2010-03-05 Impact factor: 4.466

5. Population genomics of the wild yeast Saccharomyces paradoxus: Quantifying the life cycle.

Authors: Isheng J Tsai; Douda Bensasson; Austin Burt; Vassiliki Koufopanou
Journal: Proc Natl Acad Sci U S A Date: 2008-03-14 Impact factor: 11.205

6. Cellular mechanisms of membrane protein folding.

Authors: William R Skach
Journal: Nat Struct Mol Biol Date: 2009-06 Impact factor: 15.369

Review 7. Intrinsically disordered proteins in human diseases: introducing the D2 concept.

Authors: Vladimir N Uversky; Christopher J Oldfield; A Keith Dunker
Journal: Annu Rev Biophys Date: 2008 Impact factor: 12.981

8. Non-adaptive origins of interactome complexity.

Authors: Ariel Fernández; Michael Lynch
Journal: Nature Date: 2011-05-18 Impact factor: 49.962

9. Chaperones divide yeast proteins into classes of expression level and evolutionary rate.

Authors: David Bogumil; Giddy Landan; Judith Ilhan; Tal Dagan
Journal: Genome Biol Evol Date: 2012-03-14 Impact factor: 3.416

10. Why is the correlation between gene importance and gene evolutionary rate so weak?

Authors: Zhi Wang; Jianzhi Zhang
Journal: PLoS Genet Date: 2009-01-09 Impact factor: 5.917

16 in total

1. Opposing effects of target overexpression reveal drug mechanisms.

Authors: Adam C Palmer; Roy Kishony
Journal: Nat Commun Date: 2014-07-01 Impact factor: 14.919

2. Stress Introduction Rate Alters the Benefit of AcrAB-TolC Efflux Pumps.

Authors: Ariel M Langevin; Mary J Dunlop
Journal: J Bacteriol Date: 2017-12-05 Impact factor: 3.490

3. Heterozygote Advantage Is a Common Outcome of Adaptation in Saccharomyces cerevisiae.

Authors: Diamantis Sellis; Daniel J Kvitek; Barbara Dunn; Gavin Sherlock; Dmitri A Petrov
Journal: Genetics Date: 2016-05-18 Impact factor: 4.562

4. Visual account of protein investment in cellular functions.

Authors: Wolfram Liebermeister; Elad Noor; Avi Flamholz; Dan Davidi; Jörg Bernhardt; Ron Milo
Journal: Proc Natl Acad Sci U S A Date: 2014-06-02 Impact factor: 11.205

5. Overexpression of a single ORF can extend chronological lifespan in yeast if retrograde signaling and stress response are stimulated.

Authors: Elzbieta Pogoda; Hanna Tutaj; Adrian Pirog; Katarzyna Tomala; Ryszard Korona
Journal: Biogerontology Date: 2021-05-30 Impact factor: 4.277