Qirun Wang1, Jie Lin2,3. 1. Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China. 2. Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China. linjie@pku.edu.cn. 3. Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China. linjie@pku.edu.cn.
Abstract
While most genes' expression levels are proportional to cell volumes, some genes exhibit nonlinear scaling between their expression levels and cell volume. Therefore, their mRNA and protein concentrations change as the cell volume increases, which often have crucial biological functions such as cell-cycle regulation. However, the biophysical mechanism underlying the nonlinear scaling between gene expression and cell volume is still unclear. In this work, we show that the nonlinear scaling is a direct consequence of the heterogeneous recruitment abilities of promoters to RNA polymerases based on a gene expression model at the whole-cell level. Those genes with weaker (stronger) recruitment abilities than the average ability spontaneously exhibit superlinear (sublinear) scaling with cell volume. Analysis of the promoter sequences and the nonlinear scaling of Saccharomyces cerevisiae's mRNA levels shows that motifs associated with transcription regulation are indeed enriched in genes exhibiting nonlinear scaling, in concert with our model.
While most genes' expression levels are proportional to cell volumes, some genes exhibit nonlinear scaling between their expression levels and cell volume. Therefore, their mRNA and protein concentrations change as the cell volume increases, which often have crucial biological functions such as cell-cycle regulation. However, the biophysical mechanism underlying the nonlinear scaling between gene expression and cell volume is still unclear. In this work, we show that the nonlinear scaling is a direct consequence of the heterogeneous recruitment abilities of promoters to RNA polymerases based on a gene expression model at the whole-cell level. Those genes with weaker (stronger) recruitment abilities than the average ability spontaneously exhibit superlinear (sublinear) scaling with cell volume. Analysis of the promoter sequences and the nonlinear scaling of Saccharomyces cerevisiae's mRNA levels shows that motifs associated with transcription regulation are indeed enriched in genes exhibiting nonlinear scaling, in concert with our model.
Homeostasis of gene expression level in a dynamically growing cell volume is widely observed across the domains of biology, namely, the concentrations of mRNAs and proteins of most genes are approximately constant[1-10]. Since the cell volume usually grows exponentially[9,11,12], this implies exponential growth of mRNA and protein copy numbers as well, and a linear scaling between the gene expression level and cell volume exists. Theoretical models have shown that the linear scaling between gene expression level and exponentially growing cell volume is a consequence of the limiting nature of gene expression machinery: RNA polymerases (RNAPs) and ribosomes[13], in agreement with experimental observations[10,14-16]. However, along with the constant concentrations of mRNAs and proteins of most genes, there is a subset of genes exhibiting nonconstant concentrations as the cell volume increases. These genes are often crucial for cell-cycle and size regulation. They allow cells to sense their sizes based on the concentrations of proteins that have different scaling behaviors with cell volume, such as Whi5 and Cln3 in Saccharomyces cerevisiae[17-21]. Other examples include DNA-binding proteins such as histone proteins, whose expression levels scale with the total DNA amount instead of cell volume[22]. Recent experiments show that proteins with changing concentration are often associated with cell senescence[23]. A fundamental question then arises: if a linear scaling between gene expression level and cell volume are by default for most genes, how can cells achieve nonlinear scaling for a subset of genes with cell volume in the meantime? In this paper, we show that the superlinear and sublinear scaling of gene expression level is a direct consequence of the heterogeneous recruitment abilities of promoters to RNA polymerases. Given a unimodal distribution of recruitment abilities, those genes with their promoters’ recruitment abilities below (above) the average spontaneously exhibit superlinear (sublinear) scaling with cell volume, while genes with recruitment abilities near the average exhibit approximately linear scaling.In the following, we first introduce a gene expression model at the whole-cell level in which the promoters of all genes have the same recruitment abilities to RNAPs, which is analytically solvable. Then we consider a scenario in which all genes except a small subset of genes have the same recruitment abilities and show that the expression levels of those special genes can exhibit superlinear or sublinear scaling with cell volume depending on their relative magnitudes of recruitment abilities compared with the majority of genes. Then we extend the simplified model to allow a continuous distribution of recruitment abilities and show that our simplified model can quantitatively capture this more realistic scenario. Genes with recruitment abilities below (above) the average value naturally exhibit superlinear (sublinear) scaling with cell volume. Finally, to verify our theoretical predictions, we analyze the nonlinear scaling of mRNA numbers vs. cell volume of S cerevisiae using the data from Ref. [21]. Our model predicts a positive correlation between the mRNA production rates and nonlinear scaling degrees among genes, and experimental data confirm this. We further analyze the promoter sequences of all genes with measured nonlinear scaling degrees and find that special motifs for transcription factor binding are indeed enriched in the promoters of genes exhibiting nonlinear scaling, in concert with our theoretical predictions. Our results imply that the nonlinear scaling of gene expression level can be under evolutionary selection through the promoter sequences.
Results
Model of gene expression at the whole-cell level
We consider a coarse-grained model of gene expression. The mRNA production rate k of one particular gene labeled by index i is proportional to its gene copy number (g) and the probability for its promoters to be bound by RNAPs (P),whereHere Γ is the initiation rate of transcription of gene i: the rate that a promoter-bound RNAP starts transcribing the gene and producing mRNA. We assume the probability for one promoter to be bound by an RNAP follows the Michaelis-Menten equation where c is the concentration of free RNAPs in the nucleus[24] (see the schematic in Fig. 1). For simplicity, we consider all the RNAPs to be in the nucleus and neglect the small fractions of RNAP intermediates that may exist in the cytoplasm.
Fig. 1
A coarse-grained model of gene expression at the whole-cell level.
Inside the nucleus, genes compete for free RNAPs to bind to their promoters and for simplicity we only show two genes in this schematic. The binding probability of free RNAPs to the promoters depends on the free RNAP concentration and the recruitment abilities of the promoters. In the cytoplasm, mRNAs compete for free ribosomes to bind to their ribosome binding sites and the binding probability depends on the free ribosome concentrations and the recruitment abilities of the mRNAs to ribosomes. The cell volume V, which includes the cytoplasmic volume (V) and nuclear volume (V), is proportional to the total protein mass M. The ratio between the nuclear volume and cell volume is constant.
A coarse-grained model of gene expression at the whole-cell level.
Inside the nucleus, genes compete for free RNAPs to bind to their promoters and for simplicity we only show two genes in this schematic. The binding probability of free RNAPs to the promoters depends on the free RNAP concentration and the recruitment abilities of the promoters. In the cytoplasm, mRNAs compete for free ribosomes to bind to their ribosome binding sites and the binding probability depends on the free ribosome concentrations and the recruitment abilities of the mRNAs to ribosomes. The cell volume V, which includes the cytoplasmic volume (V) and nuclear volume (V), is proportional to the total protein mass M. The ratio between the nuclear volume and cell volume is constant.K is the Michaelis-Menten (MM) constant which is inversely proportional to the binding rate of RNAPs on the promoters (see detailed derivations in Supplementary Discussion A). In the following, we use 1/K as a metric of the recruitment abilities of genes to RNAPs: a larger K represents a smaller recruitment ability. We assume the mRNA lifetime as τ so that given the mRNA production rate, the mRNA copy number m changes according to Eq. (6) in Methods. Note that we mainly discuss eukaryotic cells in this work in which the transcription and translation processes are spatially separate, and nonspecific binding of RNAPs to DNA are mostly irrelevant[25,26]. As we also mainly discuss the transcription of mRNA, RNAP here refers to RNAP II. Real transcription processes in eukaryotic cells are complex and may involve transcription factors, mediators, enhancers, and TATA-binding proteins[27-29]. Their effects are coarse-grained into the Michaelis-Menten constant within our model, e.g., a transcription factor that increases the expression level of one gene is equivalent to reducing the Michaelis-Menten constant in the RNAP binding probability P.Because the typical time scale for an RNAP to finish transcribing a gene is around one minute[30], much shorter than the typical doubling times, we assume that the dynamics of RNAPs along each gene is in the steady-state. Therefore, the outgoing flux of RNAPs that finish transcribing a gene (vn/L) must be equal to the initiation rate of transcription (PΓ) where n is the number of transcribing RNAPs on one copy of gene i. Here L is the length of the gene i in the number of codons and v is the elongation speed of RNAP, which we approximate as a constant. This leads to the expression of n as n = PΛ where Λ = ΓL/v is the maximum possible number of transcribing RNAPs on one copy of gene i.A similar model as transcription also applies to the translational process (Fig. 1 and Methods). In the following analysis, we assume that most proteins are nondegradable, and our results regarding nonlinear scaling are equally valid for degradable proteins. The total protein mass of a cell is M = ∑pL, in the number of amino acids where p is the protein copy number of gene i. The total protein mass is known to be proportional to the cell volume[1,31-33]. Therefore, we assume a constant ratio between the total protein mass and cell volume. We further assume that the nuclear volume is proportional to the total cell volume, supported by experimental observations[34].
A simplified model in which all genes share the same recruitment ability
In the following, we consider a simplified scenario in which the promoters of all genes have the same recruitment ability to RNA polymerases so that K = K for all i. Within this scenario, we find that the mass fractions of proteins in the entire proteome are approximately constant as the cell volume increases given fixed gene copy numbers (Supplementary Discussion B). Therefore, the protein numbers of all genes, including RNAP, are proportional to the total protein mass, and therefore also proportional to the cell volume.We also assume that almost all RNAPs are bound to a promoter or transcribing, and we will discuss its validity later in this section. Therefore, the total mRNA production rate of all genes is proportional to the total number of RNAPs, which is also proportional to the cell volume. Finally, using the fact that all genes share the same recruitment ability to RNAPs, it follows that the total mRNA production rate should be evenly distributed among all genes. Therefore, each gene’s mRNA production rate is also proportional to cell volume, which is the main result of this section.Here, we also derive the mathematical expression of mRNA production rates. As we show in Methods, the fraction of RNAPs that are bound to a promoter or transcribing is very close to 1 if n < n where n = ∑g(1 + Λ), which is the maximum number of RNAPs the entire genome can hold. In other words, the fraction of free RNAPs, which are neither bound to a promoter nor transcribing, is much smaller than 1 if n < n. Given this, we obtain the binding probability as P = n/n (Methods). Using Eq. (1), the mRNA production rate of gene i becomeswhich is proportional to the cell volume.The above scenario of gene expression, which has been called Phase 1 of gene expression[13], is the typical state of cells in normal conditions[6,14,35] and the main focus of this work. In Phase 1, the cell volume grows exponentially, and the homeostasis of mRNA and protein concentrations of most genes are maintained[13] (see detailed discussions in Supplementary Discussion B). When the total number of RNAPs exceeds the threshold value n, the linear scaling between mRNA numbers and cell volume breakdowns and cells enter a different phase of gene expression (see detailed discussions in Methods and Supplementary Discussion B).Finally, since the mRNA lifetimes are typically much shorter than the doubling time[30], mRNA productions are in quick equilibrium. Therefore according to Eq. (6) in Methods, the mRNA copy numbers can be approximated as the products of mRNA production rates and mRNA lifetimes, therefore proportional to the cell volume as well.
A more realistic model in which genes can have different recruitment abilities
We now consider a more realistic scenario in which the recruitment abilities to RNAPs of different genes can be different. We start from a simple scenario in which all genes have the same recruitment ability 1/K except one special gene i has a recruitment ability 1/K. We note that the only parameter affecting mRNA production rates as the volume changes is the concentration of free RNAPs, c, which enters the binding probability P (Eq. (2)). Since the contribution of the particular gene to the global allocation of RNAPs is negligible, the proportionality between the mRNA production rates of most genes and the cell volume is still valid. It then follows that c must change in a way that ensures that the binding probability of RNAP P is proportional to cell volume for all genes except the particular gene with a different MM constant. Therefore, if the particular gene has a lower MM constant K than the typical K, this gene is saturated earlier by the rising free RNAP concentration—it thus increases more slowly with increasing volume. A gene with a higher K, on the other hand, is not so easily saturated as the typical gene and hence increases superlinearly with volume.More quantitatively, given the probability for the promoters of most genes to be bound is still P ≈ n/n, the free RNAP concentration can be expressed as a function of n/n. Using the expression of free RNAP concentration, we obtain the mRNA production rate for the particular gene i with the MM constant equal to K:As n gets closer to n (with F ≪ 1 still satisfied), we find that if K > K (K < K) the mRNA production rate of gene i exhibits a superlinear (sublinear) dependence on the RNAP number. Because the RNAP number is proportional to cell volume, this nonlinear relation is also valid between the mRNA production rates and cell volume, which leads to the nonlinear scaling between the mRNA copy numbers and cell volume. Furthermore, if the corresponding proteins have short lifetimes, their copy numbers will be proportional to the mRNA production rates as well. More importantly, we find that the nonlinear scaling in mRNA copy numbers also propagates to nondegradable proteins, suggesting that the nonlinear scalings in protein copy numbers are insensitive to their lifetimes (Methods).In the following, we discuss a more realistic scenario that is a continuous distribution of K to reflect the promoter structures’ heterogeneity. In this case, we propose that the nonlinear scaling, Eq. (4), is still approximately valid for any gene if K is replaced by 〈K〉, the average value of K among all genes with some appropriate weights. In Supplementary Discussion D, we show that the appropriate weight can be well approximated by the protein mass fractions, as we confirm numerically in the next section. Therefore, those genes with K larger (smaller) than 〈K〉 exhibit superlinear (sublinear) scaling, while genes with K near the average exhibit approximately linear scaling.We remark that to ensure the linear scaling of the majority of genes, the RNAP number and the ribosome number should be proportional to the cell volume, which requires that the MM constants of RNAP and ribosome are close to the average value. These are the additional assumptions we make in the case of continuously distributed K. For RNAP, it is supported by the constant mRNA concentrations of RNAP related genes observed in the experimental data from Ref. [21] (Supplementary Fig. 10). For ribosome, we found a small deviation of ribosomal mRNA number from linear scaling (Supplementary Fig. 9). However, as we show later, our theoretical predictions on the nonlinear scaling of gene expression level still work satisfyingly well in the presence of a small deviation of the ribosome from linear scaling.To quantify the nonlinear degree between the expression level and the cell volume of each gene, we introduce a parameter β for mRNAs (and also α for nondegradable proteins, see detailed derivations and discussions in Methods), so that its scaling with cell volume becomesHere the mRNA are measured relative to their values at time t = 0, e.g., . A similar formula hold for proteins (Eq. (19) in Methods). Positive (negative) β represent sublinear (superlinear) scaling behaviors. They are related to the recruitment abilities of their corresponding genes to RNAPs (see the full expressions in Methods) and when K is close to the average Michaelis-Menten constant 〈K〉, we find that β ∝ 〈K〉 − K.
Numerical simulations
We numerically tested our theoretical predictions by simulating a cell with 2000 genes, including RNAP and ribosome, which we coarse-grained as single proteins. Simulation details are summarized in Methods, and parameter values are shown in Supplementary Table 1. To avoid the effects of the cell cycle, which is certainly important but also complicates our analysis, we mainly considered the scenario in which the cell volume grows without cell-cycle progress e.g., cells arrested in the G1 phase, which is a common experimental protocol to study the effects of cell volume on gene expression[21,35]. We also simulated the case of the periodic cell cycle as we discuss later in the Discussion section.We first simulated the simplified model with homogeneous recruitment abilities to RNAPs. We confirmed the exponential growth of cell volume and the linear scaling between mRNA copy numbers, protein copy numbers, and cell volume (Supplementary Fig. 2). We then simulated the case when all genes share the same recruitment ability K, but two genes have different abilities that are respectively smaller and larger than K. Our theoretical predictions for the expression levels of mRNAs, nondegradable proteins (Fig. 2a, b), and degradable proteins (Supplementary Fig. 3), which assume short mRNA lifetimes, match the simulation results reasonably well. We note that the lifetimes of degradable proteins can be comparable to the duration of the cell cycle, e.g., half of the cell-cycle duration[36], in which case deviations of numerical simulations from our theoretical prediction are expected (Supplementary Fig. 3). The nature of nonlinear scaling, whether superlinear or sublinear, is nevertheless independent of the protein’s lifetime, as we show in Methods.
Fig. 2
We simulate the model with two special genes and their Michaelis-Menten constants are respectively K = 20K, K = 0.2K.
All the other genes have K = K. (a) The mRNA levels of the two special genes show superlinear and sublinear scalings with cell volume, in agreement with the theoretical predictions (dashed lines). The mRNA numbers and cell volume are normalized by their initial values. b Same analysis as (a) for nondegradable proteins. Deviations are expected since the actual mRNA lifetimes are finite.
We simulate the model with two special genes and their Michaelis-Menten constants are respectively K = 20K, K = 0.2K.
All the other genes have K = K. (a) The mRNA levels of the two special genes show superlinear and sublinear scalings with cell volume, in agreement with the theoretical predictions (dashed lines). The mRNA numbers and cell volume are normalized by their initial values. b Same analysis as (a) for nondegradable proteins. Deviations are expected since the actual mRNA lifetimes are finite.Finally, we simulated the more realistic scenario in which K is continuously distributed, which we modeled as a lognormal distribution. Our conclusions are independent of the specific choice of the form of distribution. Examples of volume dependence of several mRNAs and proteins are shown in Supplementary Fig. 4. Nonlinear degrees of mRNAs and proteins are measured based on Eqs. (5, 19).The resulting distribution of mRNAs is shown in Fig. 3a, and those of proteins are shown in Supplementary Fig. 4. To test our theoretical predictions on the relations between the nonlinear degrees and K (Eqs. (17, 20) in Methods), we also need the expression of the appropriate average 〈K〉. We found that the appropriate average Michaelis-Menten constant is the one weighted by the initial protein mass fractions, 〈K〉 = ∑ϕK, where ϕ is the mass fraction of proteins i in the entire proteome at time t = 0. The above expression of 〈K〉 leads to a good agreement between numerical simulations and theoretical predictions (Fig. 3b). We also used an alternative weight that is the time-averaged protein mass fractions, which works equally well (Supplementary Fig. 5). We set the Michaelis-Menten constants of RNAP and ribosome as the average value to ensure that their concentrations are approximately constant as the cell volume increases.
Fig. 3
Simulation of the scenario in which K is continuously distributed.
a Distribution of the measured nonlinear degrees β of mRNA numbers from numerical simulations. The dashed line marks the location of the median value of the nonlinear degrees. b We compare the theoretically predicted nonlinear degrees of mRNA numbers and the measured one from numerical simulations. In (a) and (b), the coefficient of variation of K is 0.5. c, d The same analysis as (a, b) with the coefficient of variation of K equal to 1.
Simulation of the scenario in which K is continuously distributed.
a Distribution of the measured nonlinear degrees β of mRNA numbers from numerical simulations. The dashed line marks the location of the median value of the nonlinear degrees. b We compare the theoretically predicted nonlinear degrees of mRNA numbers and the measured one from numerical simulations. In (a) and (b), the coefficient of variation of K is 0.5. c, d The same analysis as (a, b) with the coefficient of variation of K equal to 1.In Fig. 3a, b, we set the coefficient of variation (standard deviation/mean) of the MM constants as 0.5. To confirm the validity of the model since the recruitment abilities of different promoters can be widely different[27,28,37,38], we also simulated a larger coefficient of variation equal to 1 (Fig. 3c, d). The resulting nonlinear degrees exhibit a broader distribution and appear more similar to experiments as we show in the next section. Furthermore, the theoretical predictions of the nonlinear degrees of mRNA numbers still match reasonably well with the simulations.We note that the recruitment ability not only determines the nonlinear degree of volume scaling but also affects the mRNA production rate since a higher recruitment ability enhances the binding probability of RNAPs to the promoter. This suggests that there should be a positive correlation between the mRNA production rate and the nonlinear degree. Meanwhile, we note that the recruitment ability also depends on the transcription initiation rate Γ (Supplementary Discussion A, Eq. (S2)): a higher initiation rate reduces the recruitment ability (increases the MM constant). For simplicity, in most of our simulations, we consider a constant Γ for genes (except for ribosome and RNAP), and in this case, we indeed found a strong positive correlation between the mRNA production rates and the nonlinear degrees β (Supplementary Fig. 4e). However, in a more general model with heterogeneity in Γ, a higher initiation rate increases the mRNA production rate but also reduces the recruitment ability so that decreases the nonlinear degree. Therefore, heterogeneity in the initiation rates reduces the correlation between the mRNA production rates and the nonlinear degrees. To confirm this prediction, we simulated the case of heterogeneous initiation rates (see numerical details in Supplementary Discussion E), and our predictions are confirmed numerically (Supplementary Fig. 6). We note that this may be a plausible mechanism of the weak but positive correlation observed in the experimental data, as we discuss in the next section.Our results suggest that those genes with sublinear scaling and smaller K contribute more in the weighted average of K, therefore where is the average over all genes with equal weights. Therefore, genes with are expected to exhibit superlinear scaling. However, to have an estimation of the nonlinear degree of most genes, the appropriate MM constant to compare with 〈K〉 is the median value. For the lognormal distribution we used in simulations, we found that the median value of K is close and slightly larger than 〈K〉. Therefore, the nonlinear degree of the median K is slightly negative compared with the entire distribution (Fig. 3a, c), which is consistent with the experimental observations (Fig. 4a).
Fig. 4
Analysis of experimental data.
a Distribution of the nonlinear scaling degrees of mRNAs of S. cerevisiae among genes. The dashed line marks the location of the median value of the nonlinear degrees. We consider genes with − 1 < β < 3.2, including 95% of all the measured genes. b The Pearson correlation coefficient between the nonlinear degrees of mRNAs and the mRNA production rates is 0.17 (two-sided Pearson correlation test, p value < 2.20e-16). The red data points are median values after binning. For the same data, the Spearman correlation coefficient is 0.35 (Spearman correlation test, p value < 2.20e-16). c Genes annotated as negative regulation of cell cycle are enriched in the sublinear regime. In the bottom panel, genes are ordered by the nonlinear degree β from positive (sublinear) to negative (superlinear). In the middle panel, the vertical lines represent the locations of the cell-cycle inhibitors. The upper panel shows the running enrichment score (ES) for the gene set, where the score at the peak is the ES for this gene set. The top-right legend includes the p value and the FDR q value of GSEA. d An example of a motif that is enriched in the sublinear regime. Here, the vertical lines in the middle panel represent the locations of genes containing the particular motif in the promoter sequences. Note that the motif also appears in the weakly superlinear regime but diminishes in the strongly superlinear regime. e We pick out all 77 motifs enriched in the promoters of sublinear genes and calculate the average β over genes with at least n motifs. f The functions of transcription factors associated with the 77 motifs enriched in the sublinear regime. Positive regulation of transcription by RNA polymerase II (GO:0045944) is enriched with p value = 2.41e-32. The 76% positive regulation is not likely generated from random sampling (single-sided hypergeometric test, p value = 5.29e-4).
Analysis of experimental data.
a Distribution of the nonlinear scaling degrees of mRNAs of S. cerevisiae among genes. The dashed line marks the location of the median value of the nonlinear degrees. We consider genes with − 1 < β < 3.2, including 95% of all the measured genes. b The Pearson correlation coefficient between the nonlinear degrees of mRNAs and the mRNA production rates is 0.17 (two-sided Pearson correlation test, p value < 2.20e-16). The red data points are median values after binning. For the same data, the Spearman correlation coefficient is 0.35 (Spearman correlation test, p value < 2.20e-16). c Genes annotated as negative regulation of cell cycle are enriched in the sublinear regime. In the bottom panel, genes are ordered by the nonlinear degree β from positive (sublinear) to negative (superlinear). In the middle panel, the vertical lines represent the locations of the cell-cycle inhibitors. The upper panel shows the running enrichment score (ES) for the gene set, where the score at the peak is the ES for this gene set. The top-right legend includes the p value and the FDR q value of GSEA. d An example of a motif that is enriched in the sublinear regime. Here, the vertical lines in the middle panel represent the locations of genes containing the particular motif in the promoter sequences. Note that the motif also appears in the weakly superlinear regime but diminishes in the strongly superlinear regime. e We pick out all 77 motifs enriched in the promoters of sublinear genes and calculate the average β over genes with at least n motifs. f The functions of transcription factors associated with the 77 motifs enriched in the sublinear regime. Positive regulation of transcription by RNA polymerase II (GO:0045944) is enriched with p value = 2.41e-32. The 76% positive regulation is not likely generated from random sampling (single-sided hypergeometric test, p value = 5.29e-4).
Analysis of experimental data and searching for motifs in the promoter sequences
We analyzed the genome-wide dataset from Ref. [21] where the volume dependences of mRNA levels are measured for S. cerevisiae. We calculated the nonlinear degrees of mRNA scaling with cell volume using Eq. (5) and obtained the resulting distribution (Fig. 4a). The calculated nonlinear degree β is highly correlated with the nonlinear scale calculated in Ref. [21] (Supplementary Fig. 7). The median value of nonlinear degrees is close to zero, suggesting that the majority of genes show approximately linear scaling, similar to our numerical simulations. We found that by choosing appropriate parameters, the numerically simulated distribution of nonlinear degrees matches the experimental measured distribution reasonably well (Supplementary Fig. 8).We also calculated the correlation between the mRNA production rates and the nonlinear degrees β. We used the mRNA amount at the smallest cell volume divided by the mRNA lifetime as a proxy for the mRNA production rate according to Eq. (6) (Methods). A weak but significant correlation is indeed observed, as shown in Fig. 4b. As we discussed in the section of numerical simulations, the heterogeneous initiation rates can reduce the correlation between the mRNA production rates and the nonlinear degrees. To further verify our model, we also simulated a modified model in which the nonlinear scaling is independent of the Michaelis-Menten constants (Supplementary Discussion E) and found that the correlation between the mRNA production rates and the nonlinear degrees becomes negative (Supplementary Fig. 6).We used Gene Set Enrichment Analysis (GSEA)[39,40] to find annotated functional gene sets that are enriched in the superlinear and sublinear scaling regime (Methods, Supplementary Fig. 9). Interestingly, we found that the ribosomal genes and other translation-related genes, which correspond to the coarse-grained ribosomal genes in our model, are enriched in the superlinear regime with the average β over ribosomal genes about − 0.2. Similar observations were reported in Ref. [21]. However, the superlinear scaling of ribosomal genes was also observed in the control cases in which the nonlinearities of other known nonlinear scaling genes were suppressed. Therefore, it was argued that the superlinear scaling of ribosomal genes may be an artifact due to the drug that blocks cell-cycle progress. Interestingly, recent experiments of mammalian cells found weak sublinear scaling of ribosomal proteins[23]. For RNAP related genes, we found that they indeed show linear scaling with cell volume as we assume in our coarse-grained model, which is crucial for the linear scaling between the mRNA copy numbers and cell volume (Supplementary Fig. 10). To confirm the validity of our conclusions in the presence of weakly superlinear scaling of ribosomes, we numerically simulated our gene expression model with the recruitment ability of ribosomal gene weaker than the average value and found that even with the small deviation of ribosome number from linear scaling, our theoretical predictions still agree well with the numerical simulations (Supplementary Fig. 11).In Ref. [21], the expression of pre-selected activators for the cell cycle were shown to be superlinear, while pre-selected inhibitors were shown to be sublinear. So we next checked the nonlinear degrees of all cell-cycle regulators using GSEA. We found that inhibitors are indeed enriched in the sublinear regime (Fig. 4c), but activators were not enriched in the superlinear regime. We remark that the inconsistency may be due to the preselection of regulators, but the conclusion of Ref. [21] that the interplay of inhibitors and activators can trigger cell-cycle progress is still valid as long as they have different scaling behaviors.To further support our theoretical predictions, we investigated the promoter sequences of all genes included in our analysis. We expect that those genes with nonlinear scaling should have some special patterns in their promoter sequences, which render them stronger or weaker recruitment abilities to RNAPs than the others. If this is the case, specific motifs should be enriched in the superlinear or sublinear regime. We detect the transcription factors binding motifs in the promoter sequences and then used GSEA to identify those motifs that are enriched in the nonlinear regime. We found 77 motifs enriched in the sublinear regime (see one example in Fig. 4d and Supplementary Table 2). To further validate our results, we computed the average β for genes containing at least n motifs that are enriched in the sublinear regime and found that the average β indeed increases as a function of n (Fig. 4e). Consistent with our theoretical predictions, 76% of the 49 corresponding transcription factors exhibit positive regulation and therefore enhance the recruitment abilities to RNAPs of their target genes (Fig. 4f). However, we did not find motifs enriched in the superlinear regime. Considering the cumulative effect of motifs on β as shown in Fig. 4e, we propose that antagonistic effect may also exist among motifs, which suggests that motifs reducing the recruitment ability to RNAPs may reside in most genes. But in genes without superlinear scaling, their effects are counteracted by other motifs that enhance the recruitment ability.
Discussion
For abnormally large cells, the number of RNA polymerases or ribosomes may exceed some threshold values so that the bottlenecks of gene expression become the templates of gene expression: gene copy numbers and mRNA copy numbers[14,35]. However, in typical cellular physiological states, cells are far below the thresholds, and the limiting bottlenecks of gene expression are the copy numbers of RNAPs and ribosomes[10,15,16]. In this case, if the promoters of all genes share the same recruitment ability, the expression levels of all genes should exhibit linear scaling with cell volume both at the mRNA and protein levels[13]. We extended this simple scenario to a more realistic case in which the recruitment abilities among genes are continuously distributed. We derived the dependence of the mRNA production rate on cell volume. We show that genes with recruitment abilities below (above) the average exhibit superlinear (sublinear) scaling with cell volume, a natural consequence of the heterogeneous distribution of recruitment abilities. We further show that the nonlinear scaling between the mRNA production rates and cell volume propagates to the mRNA copy numbers and proteins copy numbers. All of our theoretical predictions were confirmed by numerical simulations.Nonlinear scaling of protein levels is crucial to cell-cycle regulation[21]. Time-dependent protein concentrations allow cells to determine the timing of various cell-cycle events, e.g., based on the ratio of two proteins with different scaling behaviors. To confirm this scenario, we also simulated the case of the periodic cell cycle and let the cell divide when the ratio of the concentrations of one superlinear protein and one sublinear protein exceeds some threshold value. We found that periodic patterns of mRNA and protein concentration emerge. For superlinear genes, their mRNA and protein concentrations decrease initially at the beginning of the cell cycle due to the halved RNAP number at cell birth, but quickly increases as the RNAP number increases (vice versa for sublinear genes). As the cell gets the periodic steady-state, all mRNAs and proteins double their numbers at cell division compared with cell birth[13,41] (Supplementary Discussion F and Supplementary Fig. 12).Our model shares some similarities with the model introduced in the Methods section of Ref. [13], but also with key differences. This work focuses on the effects of heterogeneous MM constants and the resulting nonlinear scaling of gene expression levels, including both the mRNA and protein. In contrast, the model in the Methods of Ref. [13] only consider transcription process and homogeneous MM constants. Furthermore, the previous model mainly considers the effects of nonspecifically bound RNAPs, which is believed to be important in bacterial gene expression[24]. The model in this work mainly considers eukaryotic cells, and the experimental data we analyze is from S cerevisiae[21].The recruitment abilities not only determine the nonlinear degrees of gene expression but also determine the mRNA production rates. Therefore, our model predicts that genes with higher (lower) mRNA production rates are more likely to exhibit sublinear (superlinear) scaling with cell volumes. We also note that heterogeneity in the transcription initiation rates can reduce the positive correlation between the mRNA production rates and the nonlinear scaling degrees, in concert with the small but positive correlation observed in the experimental data of S. cerevisiae[21]. Furthermore, according to our theoretical models, motifs that enhance or reduce the promoters’ recruitment abilities should exist in the superlinear or sublinear scaling genes. Indeed, we found a group of motifs enriched in the sublinear regime, and these motifs are associated with transcription factor (TF) binding sites that have positive regulation on the target genes. We note that other mechanisms of nonlinear scaling of gene expression levels are possible, such as time-dependent transcription factor concentrations[21], or time-dependent initiation rates. A time-dependent transcription factor concentration is equivalent to a time-dependent MM constant K within our model. However, we note that our GSEA analysis showed that TF-related terms were not enriched in the nonlinear regime (Supplementary Fig. 9), which means TFs do not change their concentrations in general. Therefore, we argue that a changing TF concentration is more specific to certain genes instead of a general situation. Also, we remark that our model does not require time-dependent variables to achieve changing concentrations, and the changing concentrations of mRNAs and proteins are the result of the competition between the genes to the limiting resource of RNAPs. We note that in Ref. [21], the numbers of binding sites for specific cell-cycle transcription factors in the target genes were found to be positively correlated with the superlinear degrees of their mRNA levels, which may be due to other mechanisms not related to our model.Our results can have far-reaching implications: the nonlinear scaling of gene expression level allows cells to sense their sizes based on the ratio of concentrations of different proteins, enabling cells to decide the timing of multiple cell-cycle events such as cell division. Our results suggest that sensing the concentration differences among a group of proteins as a measure of cell volume can be the most accessible option cells can take to achieve cell size regulation. The promoter sequences can also be under evolutionary selection to achieve desired nonlinear scalings of particular genes and robust cell-cycle regulation. Finally, the gene expression model proposed in this work is by construction at the whole-cell level. Therefore, it can be a valuable platform for mathematical modeling of gene expression, especially for problems in which the competition among genes for the limiting resources of RNAPs and ribosomes are crucial.
Methods
A summary of the variables used in the main text
Details of the gene expression model
We explain more details of the gene expression model in this section. Given the mRNA production rate, the mRNA copy number m changes aswhere τ is the mRNA lifetime.Regarding protein production, the protein production rate of one particular gene is proportional to its corresponding mRNA number (m), the translation initiation rate (Γ), and the probability for the ribosome binding site of mRNA to be bound by ribosome:Here c is the concentration of free ribosomes in the cytoplasm and K is the Michaelis-Menten constant of ribosome binding on the mRNAs (see an alternative formulation of translation model in Ref. [42]).Because the total number of RNAPs can be separated to free RNAPs, initiating RNAPs, and transcribing RNAPs, we obtained a self-consistent equation to determine the fraction of free RNAPs in all RNAPs:Here n is the total number of RNAPs, n = ∑g(1 + Λ), and F is the fraction of free RNAPs. c is the concentration of total RNAPs in the nucleus and here we consider the simplified model in which all genes have the same MM constant K to RNAPs. The left side represents the number of RNAPs bound to promoters or transcribing. The right side represents the difference between the total number of RNAPs and free RNAPs, which should be equal to the left side. Meanwhile, we assume that c ≫ K, namely, the total RNAP concentration is much larger than the MM constant of a typical promoter, supported by observations in bacteria[43]. We argue that this assumption is biologically reasonable because if all RNAPs suddenly become free so that c = c, one would expect that these free RNAPs will have a strong tendency to rebind to the promoters; otherwise, a large fraction of RNAPs will be idle, which is clearly inefficient in normal cellular physiological states. We remark that although the assumption appears reasonable, it remains to be tested in yeast. Using the assumption that K/c ≪ 1, we find that the fraction of free RNAPs F that solves Eq. (8) must be much smaller than 1 if n < n (see the illustration in Supplementary Fig. 13). Therefore, we can take F = 0 in Eq. (8) and obtain the binding probability as P = n/n.When the number of RNAPs exceeds the threshold value n, the linear scaling between the mRNA numbers and cell volume breakdowns for all genes, and the growth mode of cell volume also deviates from exponential growth (Phase 2). If the ribosome number exceeds some threshold value, the cell volume eventually grows linearly, which has been observed in budding yeast[35] (Phase 3, see Ref. [13] and detailed derivations in the Supplementary Discussion B). We note that the main purpose of the assumption c ≫ K is to make the condition of the negligible fraction of free RNAPs more well defined as n/n < 1. Our conclusions on the relation between the nonlinear scaling and the recruitment abilities do not rely on this assumption. Since we mainly focus on the scenario in which RNAP is limiting with F ≪ 1, the transition details from Phase 1 to Phase 2 is not important to our conclusions. We also discuss the effects of nonspecific binding of RNAPs on the transition between Phase 1 and Phase 2 in Supplementary Discussion C and show that the condition of Phase 1 becomes more stringent in the presence of nonspecific binding.
Derivation of the nonlinear scaling
We consider a simple model assuming all genes have the same MM constant K except one special gene i has a MM constant K. Since the contribution of the particular gene to the global allocation of RNAP is negligible, Eq. (8) is still valid. We focus on Phase 1 so that F ≪ 1 and express cF as a function of n/n. The mRNA production rate for the particular gene with the MM constant equal to K therefore becomes Eq. (4).If K > K, we find that the mRNA production rate of gene i exhibits a superlinear dependence on the RNAP number, therefore also the cell volume. If K < K, the mRNA production rate of gene i exhibits a sublinear behavior. The general solution for Eq. (6) becomesIn the limit that the lifetime of mNRA goes to zero, the number of mRNA becomes strictly proportional to the mRNA production rateWe now consider the dynamics of protein number which isThe general solution for Eq. (11) isIn the following, we assume cF to be constant, which is a good approximation in Phase 1 (Supplementary Discussion B) and consider two limiting cases, τ → 0 and τ → ∞.When τ → 0, the protein number becomes strictly proportional to the protein production rate, which is proportional to the mRNA number,Note that the second identity is valid when τ → 0 which is a good approximation when the lifetime of mRNA is much shorter than the doubling time.When τ → ∞ and τ → 0 the dynamics of protein number becomesIn Phase 1, the number of RNA polymerase increases exponentially n(t) = n(0)e, therefore the integral in Eq. (14) can be analytically calculatedThe nonlinear scaling of mRNA copy number also propagates to nondegradable proteins.To quantify the nonlinear degrees of mRNA copy number, we investigate the volume dependence of mRNA copy number and normalize the volume and mRNA number by their initial values. Using Eqs. (4), (10)), we obtainHere , , ΔK = K − K. We have used the fact that the RNAP concentration is constant therefore n(t)/n(0) = V(t)/V(0). Note that when K is continuously distributed, we should replace K by 〈K〉. Comparing with Eq. (5) in the main text, we find thatWhen ΔK = 0, β = 0 as expected. For proteins with short lifetimes, the above analysis is equally valid.We also study the nonlinear degree of nondegradable proteins, and using Eq. (15), we find thatHere we combine all the constants divided by p(0) before the logarithmic term in Eq. (15) to C and . Therefore, we can write Eq. (18) aswhereIn the limit ΔK → 0, we find thatwhich is consistent with Eq. (14) in the case of K = K.
Details of numerical simulations
All simulations were done in MATLAB (version R2020b and R2021a). We summarize some of the parameters we used in the numerical simulations in Supplementary Table 1. The gene copy numbers are time-independent which we set as 1 for all genes except the ribosome gene which we set as 5. Given an attempted growth rate μ0, we set the attempted mass fraction of ribosomes in the entire proteome asHere, L is the length of the ribosome gene in the unit of codons. Note that the actual mass fraction is slightly time-dependent and deviates from the attempted value. To get Eq. (22), we assume that all ribosomes are actively translating and neglect the correction due to free ribosomes and initiating ribosomes. Given the mass fraction of ribosomes, we assume that the copy number of RNAPs is about 10% of that of ribosomes and set the attempted mass fraction of RNAP asHere L is the length of the RNAP gene. In Fig. 2, we set K = 6 × 103/μm3, and the two special genes with K = 20K and K = 0.2K. We also set the lifetimes of all mRNAs as 10 mins. In Fig. 3, we set K following a lognormal distribution so that it’s average . We also set the lifetimes of mRNA following a lognormal distribution with a mean equal to 10 min and a coefficient of variation equal to 1. In all simulations, we set the initial attempted total protein mass as M = 109 in the unit of amino acid number and the attempted critical RNAP number as n = ∑g(1 + Λ) = 104.To find the appropriate value of Γ that leads to the attempted n, we assume cF ≪ K so thatfrom which we can find that Γ = yΓ where Γ and Γ are respectively the transcription initiation rates of ribosomes and RNAPs. y can be found using the above equation. We also set Γ = xΓ for all i except the genes for RNAP (i = 1) and ribosome (i = 2), so thatfrom which we can find the expression of x. Finally, using , we find thatfrom which we find Γ and Γ for i > 2. Note that the approximation cF ≪ K for all i is merely to find the values of Γ, which do not affect our conclusions.Before we run the simulations, we use the attempted ϕ, initial cell mass, and Γ with g and K to calculate the mRNA production rates for all genes using Eqs. (1, 2), from which we use m = kτ as the initial condition. Using the initial m, we compute the initial protein mass fractions as ϕ = mL/∑mL and also update the mass fractions of RNAP and ribosome. The actual initial mass fraction of RNAP can slightly deviate from the attempted value. To make sure the RNAP copy number is continuous from the beginning of simulations, we also slightly shift the total protein mass from the attempted value so that ϕ(actual)M(actual) = ϕ(attemped)M(attempted).To remove the transient effects, we take the simulation results at t = 20 as the initial values when we compare our results with theoretical predictions. To simulate degradable proteins, we choose 200 of the genes as degradable proteins with lifetimes τ = 10 mins and for degradable proteins, we take the simulation results at t = 100 as the initial values. The simulation is stopped when the total protein mass is larger than 9M.
Details of the mRNA production rates
For the mRNA production rate, we used the mRNA amount (the product of RPKM and cell volume) at the smallest cell volume divided by the mRNA lifetime from Ref. [44] as a proxy according to Eq. (6) in Methods.
Details of motif searching in the promoters of genes
Promoters are defined as 500 bp upstream of transcription start sites in the main text. Sequences of promoters for every gene were downloaded from the Yeastract database[45]. Then we used the online tool CentriMo[46] of MEME suite (version 5.3.2)[47] to detect the transcription factors binding motifs annotated in the Yeastract database in these sequences with default parameters.
Details of GSEA
We performed enrichment analysis using GSEA using package clusterProfiler (version 3.12.0)[48] in R (version 3.6.1). Genes were ordered by nonlinear degree β. The cut-off criteria were set as both the p value and the false discovery rate (FDR) q value < 0.05. The number of permutations used in the analysis is 1e5.In the analysis of functions, the gmt file was generated from KEGG BRITE hierarchy files containing KEGG objects (KO) for budding yeast in Kyoto Encyclopedia of Genes and Genomes (KEGG) database[49-51]. In the analysis of cell-cycle regulators, we generated a gmt file containing only two terms, negative regulation of cell cycle (GO:0045786) and positive regulation of cell cycle (GO:0045787), obtained from Gene Ontology (GO) database with AmiGO (version 2.5.15)[52-54].In the analysis of motifs, the gmt file was generated based on the CentriMo results. Every motif was considered as a gene set, including multiple genes that contain this motif in their promoters.
Details of functional enrichment analysis of TFs
Transcription factors corresponding to the motifs enriched in GSEA were picked out. Functional enrichment analysis for budding yeast was done using Metascape (version 3.5)[55] with default parameters. In 641 genes whose function are regulation of transcription, DNA templated (GO:0006355), 303 genes are annotated as positive regulation of transcription by RNA polymerase II (GO:0045944). In our analysis, all the 49 TFs corresponding to the motifs enriched in GSEA (except one gene ABF2) are annotated as GO:0006355 and 34 of them are annotated as GO:0045944. Based on this situation, single-sided hypergeometric test was performed using R function phyper.
Reporting Summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Variables
Meaning
V
cell volume
M
totle protein mass
Vc
cytoplasmic volume
Vn
nuclear volume
ρ
protein mass per cell volume
a
ratio between cell volume and nuclear volume
kn,i
mRNA production rate of gene i
Γn,i
transcriptional initiation rate of gene i
gi
gene copy number of gene i
Pb,i
RNAPs binding probability on the promoter of gene i
cn,free
free RNAPs concentration
Fn
free RNAPs fraction
Λn,i
maximum number of RNAPs one copy of gene i can hold
nc
maximum number of RNAPs the entire genome can hold
Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock Journal: Nat Genet Date: 2000-05 Impact factor: 38.330
Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205
Authors: Vamsi K Mootha; Cecilia M Lindgren; Karl-Fredrik Eriksson; Aravind Subramanian; Smita Sihag; Joseph Lehar; Pere Puigserver; Emma Carlsson; Martin Ridderstråle; Esa Laurila; Nicholas Houstis; Mark J Daly; Nick Patterson; Jill P Mesirov; Todd R Golub; Pablo Tamayo; Bruce Spiegelman; Eric S Lander; Joel N Hirschhorn; David Altshuler; Leif C Groop Journal: Nat Genet Date: 2003-07 Impact factor: 38.330
Authors: Athanasios Litsios; Daphne H E W Huberts; Hanna M Terpstra; Paolo Guerra; Alexander Schmidt; Katarzyna Buczak; Alexandros Papagiannakis; Mattia Rovetta; Johan Hekelaar; Georg Hubmann; Marten Exterkate; Andreas Milias-Argeitis; Matthias Heinemann Journal: Nat Cell Biol Date: 2019-11-04 Impact factor: 28.824