Synthetic biology has successfully advanced our ability to design and implement complex, time-varying genetic circuits to control the expression of recombinant proteins. However, these circuits typically require the production of regulatory genes whose only purpose is to coordinate expression of other genes. When designing very small genetic constructs, such as viral genomes, we may want to avoid introducing such auxiliary gene products while nevertheless encoding complex expression dynamics. To this end, here we demonstrate that varying only the placement and strengths of promoters, terminators, and RNase cleavage sites in a computational model of a bacteriophage genome is sufficient to achieve solutions to a variety of basic gene expression patterns. We discover these genetic solutions by computationally evolving genomes to reproduce desired gene expression time-course data. Our approach shows that non-trivial patterns can be evolved, including patterns where the relative ordering of genes by abundance changes over time. We find that some patterns are easier to evolve than others, and comparable expression patterns can be achieved via different genetic architectures. Our work opens up a novel avenue to genome engineering via fine-tuning the balance of gene expression and gene degradation rates.
Synthetic biology has successfully advanced our ability to design and implement complex, time-varying genetic circuits to control the expression of recombinant proteins. However, these circuits typically require the production of regulatory genes whose only purpose is to coordinate expression of other genes. When designing very small genetic constructs, such as viral genomes, we may want to avoid introducing such auxiliary gene products while nevertheless encoding complex expression dynamics. To this end, here we demonstrate that varying only the placement and strengths of promoters, terminators, and RNase cleavage sites in a computational model of a bacteriophage genome is sufficient to achieve solutions to a variety of basic gene expression patterns. We discover these genetic solutions by computationally evolving genomes to reproduce desired gene expression time-course data. Our approach shows that non-trivial patterns can be evolved, including patterns where the relative ordering of genes by abundance changes over time. We find that some patterns are easier to evolve than others, and comparable expression patterns can be achieved via different genetic architectures. Our work opens up a novel avenue to genome engineering via fine-tuning the balance of gene expression and gene degradation rates.
All genomes encode a variety of distinct RNA and protein sequences, and the production of these biomolecules is essential for the growth and survival of organisms. Critically, these various gene products are often required in different amounts relative to one-another and this demand further varies over time [1-3]. Thus, organisms must have the ability to modulate gene expression levels to meet lifestyle needs and adapt to changing environmental conditions [4-6]. Many gene-regulatory elements are well characterized and cells have been directly engineered to produce individual protein products for decades; the strength of promoters, terminators, ribosome binding sites, etc. can all vary over several orders of magnitude to produce different amounts of recombinant gene products [7-9].More recently, the field of synthetic biology has developed and achieved success in engineering large, time-varying genetic circuits that are characterized by complex interactions and often require producing gene products whose sole job is to regulate the expression of other genes [10-15]. While the scale and capabilities of these applications are continually expanding [16-18], the goal of designing entire genomes from scratch presents tremendous challenges. One particular source of challenges is genome complexity—even the smallest free living organisms encode hundreds of gene products that interact with one-another and alter gene expression patterns in ways that are difficult to predict [19].Even smaller model systems are viruses and bacteriophages. Bacteriophages are useful from an engineering and genome-design standpoint because of their comparatively small size, genetic tractability, and potential uses in medical and biotechnological applications [20-25]. Despite the limited set of genes encoded by their genomes, phages are nevertheless capable of producing complex, time-varying patterns of gene expression [26]. While some of these dynamics are governed by networks of sequence-specific promoters and transcription factors, some phages—such as the Escherichia coli phage T7—appear to regulate a portion of their transcriptional demand by tuning production and degradation rates [27, 28]. The full range of expression dynamics that can be achieved by varying only these mechanisms is currently unknown.Here, we use an in silico model of phage infection to explore the range of gene expression complexity that can be achieved using only basic regulatory mechanisms, including transcription, transcript termination, transcript cleavage, and transcript degradation. Our approach relies on computationally evolving genomes with variable-strength promoters, terminators, and RNase cleavage sites to reproduce predefined gene expression time-course patterns. We show that this molecular-level simulation framework can discover solutions to numerous non-trivial patterns and that successfully evolved genomes for particular patterns display a variety of distinct genome architectures. Taken together, these findings lay the groundwork for future efforts to use in silico evolutionary design methods to engineer novel phage genomes.
Materials and methods
Gene expression simulations
We used the gene expression simulation platform Pinetree [29], version 0.3.0, to simulate expression of mRNA transcripts encoded by a bacteriophage genome containing a small number of genes (between three and ten in our simulations). Our gene simulations included mRNA translation and protein production—with equal ribosome binding site strengths assigned to all genes—but for the purposes of this study we analyzed only mRNA transcript levels. Gene expression in Pinetree is stochastic, and it is modeled at the level of individual molecules. In particular, RNA polymerases are tracked individually as they attach to the bacteriophage genome and commence transcription one nucleoside at a time. Additionally, the polymerases can: i) collide with other polymerases transcribing the same genomic region, ii) terminate transcription upon encountering a terminator element, or iii) read through a terminator.In addition, RNase molecules can attach to RNAse cleavage sites within transcripts and cleave them—leading to the subsequent directional degradation of RNA fragments from the 5’ to the 3’ end. Degradation of transcripts from cleavage sites is 1000-fold more efficient than degradation of nascent transcripts from their 5’ end. A detailed explanation and justification for the directional transcript degradation model is provided in Ref. [28]. We additionally extended this prior work to allow for site-specific RNase cleavage strengths, and included this feature in the Pinetree 0.3.0 release. RNase cleavage strengths can be thought of as composite parameters that incorporate both the binding specificity of the RNase molecule and the enyzmatic cleavage rate. Either of these biochemical processes may exhibit sequence specificity, and here we did not model them individually.Pinetree takes an arbitrary genome as input (which we note is not a sequence file but rather an input file defining the location of individual genes, promoters, terminators, etc.), as well as parameters describing rate constants and cellular-level properties of the infected host cell, such as cell volume or number of specific molecules within the cell. We kept most values at their defaults, but changed the footprint sizes of polymerase and ribosome molecules to better reflect their realistic size (35 and 30, respectively) and additionally set the polymerase copy number to a value of 4 to shorten simulation run-times. These polymerase molecules are all previously existing host-cell polymerases that do not degrade and are being co-opted by the phage. However, we note that the Pinetree platform allows for the production of phage-specific polymerases (provided that they are encoded on the phage genome) and this flexibility may allow for the production of more complicated genetic circuits—which are beyond the scope of this work. In all cases, we simulated the initial five minutes of real time infection.The primary phage model we studied here consists of genomes containing three protein-coding genes, each 150 nucleotides long. Thus, each encoded protein was 50 amino acids long, and none of the proteins had any function within the simulation; we merely used them to study arbitrary gene expression regulation. We also considered genomes containing ten protein-coding genes, to investigate the scalability of our approach to larger systems. In all cases, the complete genome is available for all molecular binding events at the beginning of the simulation.
Evolving genomes with specific gene expression patterns
We used our stochastic model of gene expression to evolve phage genomes to display specific temporal patterns of expression. We chose to implement a stochastic, molecular-level simulation because ordinary differential equation-based models make a number of simplifying assumptions and thus are unable to accurately model important biological effects such as molecular collisions. In all cases, we first defined a target phenotype as a gene expression time-course for all genes included in the genome. We then took a starting genome with a single, moderate-strength promoter upstream of the first coding sequence in the genome, and we subjected this genome to subsequent rounds of mutation and selection. Our evolutionary simulations ran for 5000 generations each, and we implemented a simulated annealing procedure that steadily increased the strength of selection with increasing number of generations. The goal with this setup was not to simulate realistic evolutionary processes of natural populations but rather to permit efficient exploration of a potentially rugged fitness landscape and thus maximize our chance of finding suitable genome architectures.We considered three different types of mutations: addition of a regulatory element (in this case, either a promoter, terminator, or RNase cleavage site), removal of a single existing regulatory element, or modification of the binding constant for an existing regulatory element. The addition and removal functions act as insertion and deletion mutations, respectively, whereas modifications to binding strengths are analogous to point mutations that alter specific rate parameters. All protein-coding sequences are defined in the starting genome and are not altered through any mutational step.To determine the appropriate rate constants for individual regulatory elements as well as their starting values when they are inserted into a genome, we simulated simplified genomes and measured the expression response of downstream genes (S1 Fig in S1 File) using a model containing only three coding sequences. From this qualitative analysis we chose to insert promoters at a strength of 106 (with minimum and maximum values set at 105 and 1013, respectively), RNAse sites at 5 × 10−3 (with minimum and maximum values set at 0 and 1), and terminators at a strength of 0.2 (with minimum and maximum values set at 0 and 1). After inserting an element, the element strength can be modified in subsequent mutational steps. Each modification is proposed by drawing a value from a normal distribution with mean of 1 and standard deviation of 0.1, and multiplying this value by the current element strength. If the proposed mutation results in an out-of-bounds value, the value is simply discarded and the process repeated until a valid proposed mutation is drawn.Our mutational process is not meant to mimic natural evolution. Rather, some mutation is always proposed at every generation—either element addition, removal, or strength modification. This is accomplished by first enumerating all possible mutations given the current genome status (deletion and modification can happen to every existing element, and every non-existing element can be added) and then randomly selecting one of the possibilities from this set (i.e. addition of a promoter between genes 2 and 3). Only one of each element-type can exist in a particular region between two genes, so once there is a terminator between genes 2 and 3, for instance, another terminator can not be proposed in this specific location.While some mutation will always be proposed at each generation, many of these mutations will not be accepted. We calculated the fitness of a genome from the distance between its gene expression pattern and the target pattern. Because gene expression patterns are obtained from stochastic simulations, we performed ten replicate simulations for each genome and averaged mRNA abundances across replicates before proceeding with the fitness calculations and determination of acceptance probabilities. We defined fitness in terms of the normalized root mean square error (normalized RMSE). We obtained normalized RMSE by first calculating the RMSE for a single gene k,
where y(t) is the observed gene expression level of gene k at time t, Y(t) is the target gene expression level of gene k at time t, and T is the total number of time steps simulated for the gene expression time course. Subsequently, we normalized each RMSE by the mean target abundance, , and then averaged these values across all genes,
where M is the total number of genes in the genome. Additionally, and after analysis of several distinct patterns and simulations, we determined that normalized-RMSE values ≤ 0.1 aligned with our expectation about the suitability of individual solutions. We settled on this value and applied this threshold throughout the results whenever we refer to simulations that produce a final pattern that successfully matched the target.To convert RMSEnorm into fitness f, we employed the Fermi functionWe have f = 1/2 when RMSEnorm = 0 (i.e. when the genome exactly displays the target expression pattern) and f monotonously declines to zero as RMSEnorm increases. The constant β determines how quickly f declines with increasing RMSEnorm, and thus we can vary β to modify the strength of selection in the simulations.In addition to the Fermi function, we also explored two additional fitness functions, exponential decline,
and linear decline,Both of these functions produce qualitatively similar results to the Fermi function.We evolved genomes using an accelerated origin–fixation model [30]. In this model, at any point in time the evolving population is represented by a single resident genotype. Whenever a novel, mutated genotypes arises, it can either go to fixation, thus becoming the next resident genotype, or it can be lost to genetic drift. Whether a new genotype is accepted or not is determined probabilistically according to the value of Paccept, defined as
where f is the fitness of the resident genotype, f′ is the fitness of the new mutant, and Ne is the simulated effective population size. We note that beneficial mutations are unconditionally accepted. This choice allows for rapid exploration of the fitness landscape while resulting in steady-state sampling identical to the one obtained when using the traditional Kimura formula for the probability of fixation [30].To allow for even more rapid sampling of the fitness landscape, we implemented a simulated annealing procedure where selection was weak initially but became increasingly stronger towards the end of the evolution. We modified the strength of selection by varying the parameter β in Eq (3), while keeping Ne constant at Ne = 1000 throughout. We varied β according to the following schedule: For the first 500 generations, β = 10−3, to noisily explore the fitness landscape (S2 Fig in S1 File). From generation 501 to 4500, we linearly increased β from 10−3 to 1.1. Finally, from generation 4501 to 5000, we linearly increased β from 1.1 to 1.3. This simulated annealing approach ensured that evolutionary simulations explored a wide range of distinct genome architectures yet tended to settle on one genome architecture towards the end of each evolution. For each pattern that we explored, we performed 50 replicate simulations using the above approach.
Accounting for possible gene re-arrangements
In our gene expression model, gene order may affect observed gene expression levels because transcription proceeds from the 5’ end to the 3’ end but may stochastically terminate at defined terminator sequences (with varying efficiency). Therefore, gene identity matters when comparing a genome’s expression pattern to a target pattern and it is a very different requirement to state that the first gene in a genome has to be the most expressed versus the last gene.In practical applications of genome engineering, however, we would not normally be concerned about the order of genes in the genome, as long as each gene is expressed at appropriate levels over time. Thus, we explored general patterns by simulating all possible gene arrangements for each pattern. For a three-gene genome, there are six possible combinations for matching the genes in the genome to the genes described in the target expression pattern. We simulated all six variations for each pattern, and then we selected the gene arrangement that had the lowest normalized-RMSE (averaged across 50 independent simulation replicates) as the representative for that pattern. For the 10 gene simulation, we simulated only a single predetermined gene order and note that attempting to simulate all possible gene order variants would be impractical for this case.
Removing elements with negligible effect
When analyzing final evolved genome architectures, we wanted to be certain that each regulatory element (promoter, terminator, RNase cleavage site) contained within that genome had a substantial effect on the overall expression pattern. We did this solely so that we could investigate individual genome architecture solutions and compare the diversity of genome architectures that evolved for particular patterns. Since rate constants for elements were allowed to evolve, it was possible that individual elements evolved towards rate constants that were so low that the elements had essentially no function. Thus, at the end of evolutionary simulations, we evaluated each element’s effect on the final gene expression pattern and removed all elements with a negligible effect.To do so, we removed each element from the genome and then analyzed how the resulting gene expression pattern changed relative to the complete genome as it evolved. If the normalized-RMSE value remained below the threshold that we used as a quantitative indicator of a successful simulation (using a threshold of 0.1, as previously noted), we recorded the element as a candidate for removal. After evaluating all regulatory elements in this manner, we removed the element whose removal had the least impact. We then repeated this process, greedily removing elements one by one until no further elements could be removed without pushing the normalized-RMSE value above our threshold of success. We note that this greedy algorithm for removing individual elements neglects potential epistatic effects, such as when different regulatory elements have compensatory or conflicting roles. Given the small size of the genomes that we focused on here, we expect that these effects are generally minor but may nevertheless become more pronounced for larger and more complex genomes.
Assessing the diversity of evolved genomes
Our evolutionary simulations generally resulted in multiple, independently evolved genomes that had different genome architectures for each target pattern. To quantify the diversity of evolved architectures, we calculated the entropy of the set of evolved solutions that were deemed successful for the given target. In this calculation, we only considered the presence or absence of individual regulatory elements (after removing insubstantial elements, as described above) in the genome while ignoring the rate constants/element strengths. We identified the unique genome architectures among the n successful solutions (ignoring simulations that produced a normalized-RMSE >0.1) and then determined the count n for each unique architecture i (so that ∑
n = n). We then calculated entropy H asBecause the logarithm is taken to base 2, the resulting entropy value is measured in units of bits.
Data and code availability
All code and data required to reproduce this work are available at: https://github.com/SahilBShah/pinetree-evolution and also archived on Zenodo at https://doi.org/10.5281/zenodo.4592577.
Results
Evolutionary simulation of phage gene expression
Our goal is to understand the range of possible gene expression patterns that a phage could produce by varying only a small set of regulatory components. We reasoned that rational genome design would be difficult for all but the most trivial cases, and that fully enumerating all possible genomes is impractical. Instead, we focused on developing an evolutionary strategy to engineer in silico genomes to match a diverse set of target pre-defined gene expression time-courses.This computational strategy that we developed relies on a stochastic gene expression platform (Pinetree), which simulates the process of phage infection in molecular-level detail and produces time-course data of RNA abundance as an output [29]. We emphasize here that Pinetree does not operate directly on DNA sequences but rather uses parameter file inputs that define the location and strength of individual genes, promoters, terminators, etc. on a hypothetical genome sequence. We built upon this software to enable evolutionary simulations, where discrete generations consist of individual simulations of phage infection and the resulting RNA species time-course is our ultimate phenotype of interest. Pinetree expects a genome as input and under our evolutionary approach this genome varies from generation-to-generation as the location and strengths of promoters, terminators, and RNase cleavage sites are subject to mutation, selection, and drift.As a proof-of-principle, we first used Pinetree to simulate gene expression from an arbitrary example genome architecture. The positive control genome architecture and resulting gene expression pattern produced from a single simulation are shown in Fig 1A (first panel). Using this gene expression output as our target data, we next attempted to evolve a genome from scratch that was capable of producing similar levels of gene expression over time. Our evolutionary simulation starts with a three-gene genome containing only a single promoter, whose time-course of gene expression initially bears little resemblance to the target. As generations progress, single mutations are proposed and conditionally accepted based on whether they alter the resulting gene expression pattern to more closely resemble the target (see Materials and methods). More specifically, we define the fitness of a mutation as the inverse of the normalized-Root Mean Squared Error (RMSE) relative to the target. Individual mutations that result in a better match to the target expression pattern (i.e. have lower normalized-RMSE values) are accepted, with some probability of accepting slightly deleterious mutations to allow for efficient exploration of the potentially rugged fitness landscape.
Fig 1
Evolutionary simulation of a positive control gene expression pattern.
A: Genome architectures and corresponding gene expression time courses. The target pattern (first panel) is a gene expression time course generated by an arbitrarily chosen genome architecture simulated in Pinetree. This architecture has a promoter in front of genes X and Y (indicated by inverted L with an arrow pointing right), a transcriptional terminator after genes Y and Z (indicated by a T), and an RNase cleavage site in front of gene Y (indicated by a cross). Line colors correspond to the genes labeled in the genome architecture. We use these symbols and colors consistently throughout this work to display genome architectures. The remaining panels illustrate the evolutionary process: the starting genome architecture and corresponding gene expression time-course (generation 0), an example from the middle of the evolutionary simulation (generation 488), and the final evolved architecture (final pattern). The final evolved genome architecture is similar but not identical to the target. B: The normalized-RMSE metric declines with increasing numbers of generations as the genome architecture evolves to reproduce expression patterns that are increasingly similar to the target pattern.
Evolutionary simulation of a positive control gene expression pattern.
A: Genome architectures and corresponding gene expression time courses. The target pattern (first panel) is a gene expression time course generated by an arbitrarily chosen genome architecture simulated in Pinetree. This architecture has a promoter in front of genes X and Y (indicated by inverted L with an arrow pointing right), a transcriptional terminator after genes Y and Z (indicated by a T), and an RNase cleavage site in front of gene Y (indicated by a cross). Line colors correspond to the genes labeled in the genome architecture. We use these symbols and colors consistently throughout this work to display genome architectures. The remaining panels illustrate the evolutionary process: the starting genome architecture and corresponding gene expression time-course (generation 0), an example from the middle of the evolutionary simulation (generation 488), and the final evolved architecture (final pattern). The final evolved genome architecture is similar but not identical to the target. B: The normalized-RMSE metric declines with increasing numbers of generations as the genome architecture evolves to reproduce expression patterns that are increasingly similar to the target pattern.As shown in Fig 1A (right-most panel), the best gene expression time-course that evolved over a 5,000 generation evolutionary simulation qualitatively matches the target. To quantitatively make this determination, we found—through visualization of numerous simulations—that normalized-RMSE values ≤ 0.1 generally aligned with our qualitative determination about the success of any given solution. We henceforth consider an evolutionary simulation successful if the final normalized-RMSE falls below this value, which occurs in this case (Fig 1B). Thus, this positive control shows that our approach is capable of evolving genome architectures that match complex patterns from simple starting points.
Evolving genomes to match a range of gene expression patterns
After successfully conducting an evolutionary simulation with a positive control, we applied the same method to a range of gene expression patterns that have no a priori known genetic solution. We began with a simple pattern where all three genes linearly increase in abundance over time but do so at different rates (Fig 2A, left two panels). A rational solution to this pattern might be to have a strong upstream promoter with weak terminators between each successive gene. An alternative solution would be to start with a weak upstream promoter followed by subsequent promoters between each gene so that each successive gene will be expressed at a higher level. While in the first case that we outlined the first gene in the genome would be expressed at the highest rate, in this latter scenario the last gene would be expressed at the highest rate.
Fig 2
Examples of successfully evolved gene expression patterns.
A: Each panel contains a target gene expression time-course pattern. Collectively, these patterns span a diverse range of possibilities. B: The corresponding representative gene expression patterns from successful simulations for each respective target. Shown above each time-course of gene expression is the evolved genome architecture that produced it. The architectures in the first two panels contain only promoters and terminators, while the remaining three architectures also contain RNase cleavage sites. RNase cleavage sites seem to be required for any regulatory patterns that go beyond a simple linear increase in expression level over time.
Examples of successfully evolved gene expression patterns.
A: Each panel contains a target gene expression time-course pattern. Collectively, these patterns span a diverse range of possibilities. B: The corresponding representative gene expression patterns from successful simulations for each respective target. Shown above each time-course of gene expression is the evolved genome architecture that produced it. The architectures in the first two panels contain only promoters and terminators, while the remaining three architectures also contain RNase cleavage sites. RNase cleavage sites seem to be required for any regulatory patterns that go beyond a simple linear increase in expression level over time.As we expected, the left two panels of Fig 2B show that we were able to successfully evolve this linearly increasing pattern (with two different gene arrangements) and that the evolved genome architectures match our rational expectation. The final three examples presented in Fig 2 further show successful simulations (the lowest normalized-RMSE values found across 50 independent replicates) for more complex patterns that are difficult to rationalize solutions to. While this demonstrates that genomes in our system can evolve to match several distinct patterns with qualitatively successful solutions, we wanted to further increase the complexity of our targets and evaluate replicate simulations more quantitatively.We conducted a total of 300 independent evolutionary simulations for each of 10 general patterns (Fig 3A), split into 50 simulations for each of the 6 possible gene arrangements per pattern (S3 Fig in S1 File, see Materials and methods). The patterns we explored were characterized by increases in absolute transcript abundances at various rates, as well as plateaus in abundance that are indicative of a steady-state balance between transcript abundance and degradation. More complex gene expression dynamics—such as gene expression delays and oscillations—will almost certainly require more complex regulatory circuitry than we allow for here. However, the range of possible dynamics that can be achieved even with combinations of plateaus and linear increases is unknown, which is why we chose to focus our efforts on these initial patterns.
Fig 3
Quantitative assessment of pattern achievability.
A: The 10 target gene expression time-course patterns that we simulated are displayed as black lines representing general patterns (each of which has 6 possible gene arrangements). B: For 50 replicate simulations of each pattern, shown are the lowest achieved normalized-RMSE values. All patterns had at least two successful independent evolutionary simulations, as determined by the red line highlighting a normalized-RMSE value of 0.1.
Quantitative assessment of pattern achievability.
A: The 10 target gene expression time-course patterns that we simulated are displayed as black lines representing general patterns (each of which has 6 possible gene arrangements). B: For 50 replicate simulations of each pattern, shown are the lowest achieved normalized-RMSE values. All patterns had at least two successful independent evolutionary simulations, as determined by the red line highlighting a normalized-RMSE value of 0.1.For the particular gene arrangement that yielded the best results for each pattern, Fig 3B shows the distribution of normalized-RMSE values from the 50 replicate simulations (S4 Fig in S1 File shows corresponding gene expression time-courses for the best replicate). Each pattern had at least two successful replicate simulations (normalized-RMSE ≤ 0.1), but there was nevertheless a clear difference between how easily solutions to particular patterns were able to evolve—compare and contrast boxplots for patterns #5 and #6 in Fig 3B. For pattern #5, the box lies entirely below the red line, indicating that over 75% of simulations were determined to be successful according to our cutoff criterion. By contrast for pattern #6, the box lies entirely above the red line, indicating that over 75% of the simulations were not determined to be successful.These result suggests certain gene expression time-course patterns may have a more restrictive set of genome architectures capable of producing them, but we also note that the evolutionary parameters could perhaps be altered to better optimize solutions to particular patterns. As a robustness check, we performed evolutionary simulations to match the same 8 patterns depicted in Fig 3 using two additional fitness functions and observed qualitatively similar results (S5 Fig in S1 File).
Disparate genome architectures can recapitulate the same target expression pattern
Beginning with our positive control simulations (Fig 1), we saw evidence that different genome architectures could produce comparable time-course patterns of gene expression. In this case, while the evolved genome produced a nearly identical gene expression pattern to the target, the evolved genome differed from the input genome that was used to produce the target expression data. To investigate this trend more broadly and for different gene expression patterns, we interrogated the diversity of the successfully evolved genome architectures for each pattern. Overall, we found that there was no single genome architecture that completely dominated results for any of the target patterns. As illustrated in Fig 4A, even the simplest pattern that we assessed had numerous distinct (albeit qualitatively similar) genome architectures that evolved, each of which were capable of producing gene expression time-courses that successfully matched the target. However, some patterns were more heterogeneous than others in terms of the number of possible genome architectures found (Fig 4B).
Fig 4
Diversity of successfully evolved genome architectures.
A: A sample of the successfully evolved genome architectures for pattern #1 showing that multiple potential architectures can reproduce the target expression pattern. Only the architectures that evolved most frequently are shown. Multiples on the right indicate how often each architecture arose among 50 replicates. B: Similar to panel (A), a sample of the successfully evolved architectures for pattern #5. We note the much higher frequency of terminators and RNase cleavage sites in genome architectures evolved for pattern #5 versus pattern #1. C: Entropy values for the set of successfully evolved genome architectures for each target pattern. Lower values indicate that successful replicate simulations converged upon one or a small number of genome architectures. More specifically, an entropy value of 1 corresponds to their being effectively 2 (21) distinct architectures for a given pattern from amongst the replicates, while an entropy value of 5 corresponds to their being effectively 32 (25) distinct architectures.
Diversity of successfully evolved genome architectures.
A: A sample of the successfully evolved genome architectures for pattern #1 showing that multiple potential architectures can reproduce the target expression pattern. Only the architectures that evolved most frequently are shown. Multiples on the right indicate how often each architecture arose among 50 replicates. B: Similar to panel (A), a sample of the successfully evolved architectures for pattern #5. We note the much higher frequency of terminators and RNase cleavage sites in genome architectures evolved for pattern #5 versus pattern #1. C: Entropy values for the set of successfully evolved genome architectures for each target pattern. Lower values indicate that successful replicate simulations converged upon one or a small number of genome architectures. More specifically, an entropy value of 1 corresponds to their being effectively 2 (21) distinct architectures for a given pattern from amongst the replicates, while an entropy value of 5 corresponds to their being effectively 32 (25) distinct architectures.We quantified the diversity of evolved genome architectures for successful replicate simulations for each pattern using information entropy (as in Fig 3 we analyzed data from only the best of the 6 possible gene arrangements for each pattern). Lower entropy values imply that only a single or a small number of distinct genome architectures evolved for a particular pattern. By contrast, high entropy values imply that the genome architecture solutions for a particular pattern were largely distinct from one another. Fig 4C illustrates the entropy scores for each of the 10 patterns showing that pattern 5 had the most diverse set of genome solutions whereas pattern 10 (which had only two successful simulations) was the most constrained. Taken together, these results show that particular patterns may be more flexible than others in terms of the number of possible solutions but the cause of this variability is unknown at this stage and a possible avenue for future research.
Evolutionary simulation of a ten-gene model
Although the three-gene model is ideal for studying the basic principles behind genetic regulation, the size of this genome is unrealistically small compared to most phage genomes. Thus, we wanted to test whether our simulation approach was capable of evolving larger genomes containing up to 10 genes (on par with the smallest known phage genomes). As shown in Fig 5, we achieved a qualitatively suitable result when attempting to simulate a simple linear pattern with variable rates of increase for each gene within the genome. However, we note that increasing the number of genes within the genome also increases the run-time of the simulation substantially since more generations are required to adequately explore the fitness landscape given that there are many more possible mutations (i.e. regions between genes that can potentially contain regulatory elements). This result, nevertheless, shows that our approach is generalizable and can, in principle, be extended to any number of genes with complex, predefined target patterns barring computational limitations.
Fig 5
Evolutionary simulation of a ten-gene model.
A: Shown are the target (left) and evolved (right) patterns. The evolved pattern is the best of 5 independent replicates, each of which was simulated for 100,000 generations. B: The starting (top) and best fitting evolved (bottom) genome architectures. Differential gene expression in the evolved genome happens primarily via transcription termination with readthrough. The genome has only two promoters and one RNase cleavage site, but six terminators that successively reduce transcription as we move from the 5’ to the 3’ end of the genome.
Evolutionary simulation of a ten-gene model.
A: Shown are the target (left) and evolved (right) patterns. The evolved pattern is the best of 5 independent replicates, each of which was simulated for 100,000 generations. B: The starting (top) and best fitting evolved (bottom) genome architectures. Differential gene expression in the evolved genome happens primarily via transcription termination with readthrough. The genome has only two promoters and one RNase cleavage site, but six terminators that successively reduce transcription as we move from the 5’ to the 3’ end of the genome.A further feature that is evident within this simulation is that the 3’ most genes experience significant expression delays, presumably as a result of having evolved weaker promoters that are out-competed for polymerase molecules by the stronger 5’ promoters. However, even within this 10-gene system the time-scale of these delays is modest in comparison with the 5 minute simulation time. Implementing substantial gene expression delays into realistic systems that rely solely on this delay-feature would require inserting large pieces of non-coding DNA to space apart the temporal expression of genes. While this is certainly possible, we find it likely that substantial gene expression delays can be more efficiently implemented via genetic interactions that we do not consider here.
Summary of observed design patterns
From the entirety of our simulation results, we have identified the following patterns of regulatory control. In general, combinations of promoters and terminators only lead to a linear increase of transcript abundances, possibly at differing rates, as the effect of these regulatory elements does not depend on the current abundance of a transcript. For instance, for the first gene to be expressed more rapidly than the second gene, there can be a strong promoter in front of the first gene and a weak terminator in front of the second gene. Examples include the first column of Fig 2 and also Fig 5. By contrast, a weak promoter in front of the first gene and a second weak promoter in front of the second gene will create a pattern where the second gene is expressed more rapidly than the first gene (second column of Fig 2). RNase cleavage sites are required for time courses where transcript levels reach a steady state, as the amount of degradation following cleavage is proportional to transcript levels and thus increases with increasing transcript abundance (examples include the three right-most columns in Fig 2 and all genome architectures in Fig 4B).For time courses with a cross-over event, where one transcript levels off while another, initially less abundant transcript keeps steadily increasing (last column in Fig 2), we have found that typically a weak promoter precedes the first gene, a strong promoter precedes the second gene, and an RNase cleavage site also precedes the second gene but follows the second promoter. The weak promoter allows for the first gene to steadily increase at a low rate. The strong promoter rescues transcription of the downstream gene. The following RNase cleavage site will degrade the transcripts produced, in proportion to the transcripts’ abundance, and thus the transcripts of the second gene will eventually reach steady state.Finally, even though RNase cleavage sites will generally cause eventual steady state, we have also observed cases where a weak RNase cleavage site follows a strong promoter, resulting in a time course that does not meaningfully level off on the time scale of our simulation results (examples include the third, fourth, and fifth genome architecture in Fig 4A). In those cases, the RNase cleavage site simply serves to reduce the effect of the promoter. This observation highlights that the transcript level at which a steady state is reached depends on the relative strengths of the promoter and the RNase cleavage site, and for sufficiently strong promoters and weak cleavage sites a steady state may never be observed during physiological simulation times, even though it exists and would be reached if we let the simulations run for sufficiently long times.
Discussion
Genomes of all species employ a myriad of regulatory strategies to ensure that individual genes are expressed at particular levels that vary over time. Here, we have used computational modeling to demonstrate that bacteriophages can generate a range of complex and time-varying expression patterns without the need for specific regulatory molecules or complex genetic circuitry. By varying only the strength of promoters, terminators, and RNase binding sites, our in silico evolution platform was able to generate genomes capable of matching numerous and distinct gene expression time-course patterns. These findings show that networks of interacting promoters and transcription factors are not explicitly necessary to achieve certain gene expression designs, and provide a proof-of-principle for the de novo design and engineering of phage genomes using molecular-level evolutionary simulation.Our study is motivated by the observation that naturally evolved phages appear to regulate their gene expression via the mechanisms discussed here. Phage genomes are typically small and highly gene dense, which is likely to limit the overall complexity of their native gene expression programs [31-34]. Even with relatively small genomes, numerous studies have shown that phages are capable of producing complex gene expression dynamics over the course of an infection cycle. In phage T7, for instance, genes that are expressed early in the infection cycle are typically involved in shutting down synthesis of host-cell macromolecules, whereas the genes that are expressed later have roles in viral assembly and lysis [35-37]. While the T7 genome encodes its own polymerase that is critical for the delayed expression of these late genes, phage T7 gene expression dynamics are partially accomplished through the use of variable strength promoters, transcriptional terminators, and RNase cleavage sites [27, 28, 38–41]. Incorporating knowledge of variable strength regulatory elements has been critical for generating accurate computational models of phage T7 infection that are increasingly able to recapitulate empirically measured gene expression abundances [29, 38, 42, 43]. The findings presented here suggest that this level of regulatory control is likely to apply more broadly to phages in general.The arrangements of regulatory elements that arose in our simulations look similar to what is observed in bacteriophage genomes. For example, the T7 genome contains 10 RNase III cleavage sites, six of which co-occur in the same intergenic region as a promoter [44]. This arrangement of promoters and cleavage sites results in transcript ramps and cliffs [28], where transcript abundances fall off near cleavage sites, and then steadily increase due to the presence of promoters. Similarly, bacteriophage ΦX174 has a small (5386 nt) circular genome composed of 11 genes, including a cluster of co-localized capsid genes [45]. Capsid gene expression is regulated by four terminators of varying strengths [46], arranged in an alternating gene-terminator pattern similar to that of the evolved ten-gene architecture (Fig 5). In ΦX174, terminator read-through produces a stepwise pattern of decreasing transcript abundances [34], which may help ensure that structural proteins are produced with the stoichiometries necessary for proper capsid assembly. ΦX174 also contains an overlapping promoter-terminator element, situated in the intergenic region between its structural and replication genes. In our evolved genome architectures, there is at least one example of such an element (Fig 1, right two panels); in both our simulation and in ΦX174, this regulatory feature appears to decouple expression of the adjacent genes. Finally, bacteriophage T4, which has a large genome encoding over 300 genes, makes use of at least 119 promoters, as well as a number of terminators and mRNA cleavage sites, many of which co-occur as groups of two or more regulatory elements [47]. Combinations of promoters, terminators, and cleavage sites give rise to the complex, time-dependent gene expression patterns observed in T4, which are further controlled by phage-encoded regulatory proteins.Our findings here highlight some of the strengths and the limitations of regulatory approaches built solely off varying the strength of transcription, termination, and transcript degradation. The target phenotypes we used were fairly simple. We initially only considered cases in which each gene’s transcript abundance increased linearly. We then gradually increased the complexity of each gene’s expression pattern by allowing some genes to achieve steady state at certain time points. Such patterns could be evolved but we only observed a few successful simulations among 50 replicates. Our system shows that gene products can either accumulate at different rates over time via variable strength transcription or plateau at different levels by tuning degradation rates. In the most complex cases that we investigated, a combination of variable production and degradation rates was sufficient to produce expression dynamics whereby the relative ordering of genes by abundance could switch over time (Fig 3, patterns 7–10). However, the regulatory elements that we assessed do not provide a means to shut off the production of certain genes, which would enable decreases in absolute transcript abundances at particular times. Similarly, gene expression delays within this system are limited to short time scales; there is no apparent mechanism for inducing expression at a particular time without considering more complex genetic regulation—such as the early production of a phage-specific polymerase that controls later genes as in phage T7. Finally, without allowing for more complex genetic interactions, we find it unlikely that important dynamic features such as oscillations can reliably evolve given the limited set of regulatory elements that we considered here.Despite these constraints, our approach may provide important insight into the overall organization and regulatory principles governing phage genomes. For instance, RNase cleavage sites and promoter sequences frequently co-occur in phage T7 [28]. In our simulations, we observed numerous instances of this co-occurrence, indicating that this motif may provide a general means for decoupling the expression of proximal genes. However, the generality and significance of this finding is difficult to assess based on the limited set of patterns that we investigated. Future work, particularly focused on larger genomes with more complex expression dynamics, might uncover interesting design principles that have heretofore gone unnoticed.While our results provide insight into the limits and capabilities of phage regulatory strategies, our approach may also prove useful for synthetic biology applications in phages [48, 49]. Synthetic genetic circuits are typically designed for free-living species where they must operate in variable cellular contexts over comparatively long time-scales [50, 51]. By contrast, phage infections are typically brief, with relevant time-scales on the order of tens of minutes [52, 53]. We performed our simulations with realistic cellular parameters over an initial 5 minute infection period; these parameters were previously tuned to fit empirical data on phage T7 gene expression [28] and thus represent approximately accurate values in terms of molecular rates/abundances, translocation speeds, etc. Producing more complicated dynamics from much larger genomes over longer timescales could, in principle, be accomplished using minimal genetic circuitry to divide the genome into early and late gene sets; within each of these two sets our results show that it is possible to encode diverse dynamics without any further regulatory mechanisms. Indeed, the phage T7 genome is partitioned in precisely this manner, having distinct classes of early, middle, and late expressed genes that perform distinct functions [35-37].However, in any biological application, we would ultimately want to manipulate protein abundances, and we have here only considered transcriptional regulation. Protein abundances are determined both by transcript abundances and by additional parameters relevant at the translation stage, such as the rate of translation initiation or the speed of translation elongation, which may be modulated by codon usage or mRNA secondary structure [54]. These additional parameters would have to be considered in our simulation model to produce realistic predictions. We note that the Pinetree simulator we used here already has the capability to model translation, by tracking the movement of individual ribosomes along transcripts [29]. It would be straightforward to evolve systems towards target protein abundances rather than transcript abundances, and to allow mutations in translation initiation rates and/or in codon usage bias, so that each gene can evolve unique translation initiation and elongation rates.In principle, the approach that we presented here could be adapted to explicitly design phage genomes with specified gene expression dynamics but doing so presents several challenges. While Pinetree relies on genome-level information, it does not make use of sequence-level information. Rather, elements such as promoters are encoded by their location and strength—not by an explicit sequence. Translating a particular evolutionary design into an actual genome sequence would thus require using standardized parts with known element strengths; these strengths could be calibrated against the values encoded in Pinetree to aid in genome sequence design [55]. In the case of RNase sites, this may be particularly challenging because while degradation rates are known to vary considerably across genes, the overall rules concerning RNase binding and cleavage are less well understood than the actions of promoters and terminators [56]. However, this outlook is beginning to change [57-59].We envision that the results presented here can be extended in a number of different directions in future research. First, we only considered non-overlapping gene arrangements in our study but many phage genomes are highly compact and successive genes frequently overlap; this feature might impact gene expression in ways that are difficult to predict but could be assessed using our approach [23, 31]. Second, an obvious way to create more complex gene expression dynamics would be to encode a regulatory protein on the phage genome rather than the arbitrary proteins that we have thus far investigated. Extensions of our framework to consider more genes would likely benefit from such an approach, which could allow for expression delays or decreases in the abundance of particular transcripts over time. Third, we previously noted that gene order can play an important role in achieving particular targets. With only three genes we were able to enumerate all possible gene arrangements, but this approach is intractable for larger genomes and will require an additional mutational step that swaps the placement of genes for larger applications. Finally, we focused our attention on transcript abundances but differences in translation initiation and elongation rates offer further ways to tune gene expression at the level of protein products to produce more complex dynamics that operate over longer-time scales [7, 8, 60–63].An extraordinary number of phages have been discovered in recent years, but only a small number of these have been interrogated experimentally [64-68]. Computational approaches can help to better characterize the general constraints and principles that affect genome organization and function, and can additionally provide a way to engineer phages based on predetermined design constraints. The design and engineering of whole phage genomes using principles and approaches that we developed here may have a number of future medical and biotechnological applications, including combating antibiotic resistant bacteria [69].(PDF)Click here for additional data file.28 Jul 2021Submitted filename: PLOS CB Revisions.pdfClick here for additional data file.14 Jan 2022
PONE-D-21-23762
Generating dynamic gene expression patterns without the need for regulatory circuits
PLOS ONE
Dear Dr. Hockenberry,Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.
==============================
Comments from the Academic Editor:
Reviewer 1, Reviewer 2, and I all feel that the work presented in this manuscript is both scientifically sound and of interest to the PLOS One reader base. I expect to accept this work for publication in PLOS One after the completion of a minor revision that addresses all of the comments made by Reviewers 1 and 2.
==============================Please submit your revised manuscript by Feb 28 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.Please include the following items when submitting your revised manuscript:
A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.We look forward to receiving your revised manuscript.Kind regards,William Ott, Ph.D.Academic EditorPLOS ONEJournal Requirements:When submitting your revision, we need you to address these additional requirements.1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found athttps://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf andhttps://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf2. Thank you for stating the following in the Acknowledgments Section of your manuscript:“This work was supported by National Institutes of Health grants R01 GM088344 to C.O.W. and F32 GM130113 to A.J.H. C.O.W. also received support from the Jane and Roland Blumberg Centennial Professorship in Molecular Evolution and the Dwight W.and Blanche Faye Reeder Centennial Fellowship in Systematic and Evolutionary Biology at The University of Texas at Austin. S.B.S received support from a TIDES research fellowship. The Texas Advanced Computing Center (TACC) at The University of Texas at Austin provided high-performance computing resources.”Please note that funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:“S.B.S received support from the Texas Institute for Discovery Education in Science (TIDES) in the College of Natural Sciences at the University of Texas at Austin. C.O.W. was supported by a National Institutes of Health grant R01 GM088344, as well as support from the Jane and Roland Blumberg Centennial Professorship in Molecular Evolution and the Dwight W. and Blanche Faye Reeder Centennial Fellowship in Systematic and Evolutionary Biology at The University of Texas at Austin. A.J.H. was supported by National Institutes of Health award F32 GM130113. The Texas Advanced Computing Center (TACC) at The University of Texas at Austin provided high-performance computing resources.The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”Please include your amended statements within your cover letter; we will change the online submission form on your behalf.3. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.[Note: HTML markup is below. Please do not edit.]Reviewers' comments:Reviewer's Responses to Questions
Comments to the Author1. Is the manuscript technically sound, and do the data support the conclusions?The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: YesReviewer #2: Yes********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: YesReviewer #2: Yes********** 3. Have the authors made all data underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: YesReviewer #2: Yes********** 4. Is the manuscript presented in an intelligible fashion and written in standard English?PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: YesReviewer #2: Yes********** 5. Review Comments to the AuthorPlease use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: This work provides a computational approach for producing timed patterns of gene expression. In nature, timed gene expression has been observed in numerous organisms, and is particularly well characterized within bacteriophage. The authors thus use bacteriophage genes as a template, using several rounds of evolutionary simulations to produce the desired expression patterns. While expression of only a small number of genes are considered for most of this work, authors show that their approach can be scaled up to consider a larger number of genes. Notably, while many phage use specialized proteins to regulated timed gene expression (such as T7 RNA polymerase or C1 repressor), the authors show that timed expression can be achieved without the use of such proteins. Overall, I would recommend the publication of this work with minor changes. See below for additional comments.1. A weakness in this work stems from the fact that protein translation levels were not considered. As noted in the manuscript’s introduction, ribosome binding site strength can modulate the rate of protein translation initiation by multiple orders of magnitude. Additionally, codon usage within a gene can also modulate the rate of translational elongation. Given that these translational parameters would need to be optimized within the same sequence space as RNase cleavage sites, it would likely be necessary to also consider protein translation before the computational results described in this work could be replicated within the lab. The authors should expand their discussion to include how these parameters could be accommodated by their model.2. Particularly interesting are expression patterns shown in Figure 3, in which Gene A initially has higher expression level compared to Gene B, but later in the simulation Gene A is surpassed by expression of Gene B. This pattern seems to arise Gene A having a faster initial rate of expression, but an earlier plateau. Would it be possible to devise a configuration where Genes A, B, and C to each predominate at different time points? This would be useful for the authors to show, as it would demonstrate that the different expression patterns are more complex that simply “stronger” vs “weaker” expression.3. Additional commentary regarding how initial expression rates versus the level at which a curve plateaus are differentially controlled would of benefit to this manuscript. For instance, what parameters lead to fast initial expression, and a low level plateau? What parameters result in a slow initial expression level, and a high level plateau?Reviewer #2: In this manuscript, Shah et al. use their previously developed model, Pinetree, to engineer phage genomes in silico and predict the evolution of specific phenotypes. In doing so they show the ability to encode for dynamic gene expression without the use of multiple transcription factors, but rather by varying promoter, terminator, and RNA cleavage sites. They stochastically simulated the addition and removal of such sites while selecting for mutations that more closely match the desired pattern. The models they use and stochastic simulations allow for rapid evolution towards varying gene expression patterns of up to ten genes. In addition, they briefly discuss how these genomic architectures relate to those found in bacteriophages in nature. The work is novel to my knowledge, interesting, and well carried out. With a few modifications listed below and more discussion about the relevance to bacteriophage genomes and phenotypes, I recommend it for publication in PLOS One.Major Points1. In the results section, can you briefly relate the evolved architectures to their prevalence in bacteriophages in nature? For example, is it common to see a terminator between a promoter and its gene? You discuss this broadly in the discussion (line 393) but would be interesting to compare more specifically your evolved phenotypes.a. Are the target patterns based on natural pathways and if so, can you compare that genomic architecture to your evolved ones?Minor Edits1. In Figure 1A explain the use of the symbols in the genetic diagrams and if consistent throughout the remaining figures in the paper.a. Also explain these final genomic architectures in the figure caption(s) or main text for the evolved phenotypes (i.e. the final pattern genotype has promoters in front of each gene and terminators at X positions…)2. Do the three small grey squares at the bottom of Fig.4 panels A&B mean anything?3. In Figs S3 and S4, what is the transcription regulation of the 3 genes being varied in order? Is it the same for all orders of the genes for each pattern?a. How does Fig S3 differ from the first row of graphs presented in Fig S4?********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Corwin MillerReviewer #2: Yes: Razan N Alnahhas[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.
26 Mar 2022This re-submission includes an extensive "response to reviewers" document upload, which is well formatted for clarity. Here, we will copy/paste for simplicity:Response to Reviewer CommentsWe would like to thank both reviewers for their overall positive evaluation of our work and for their constructive comments. We have made every effort to address them in the revision. For a detailed, point-by-point response, please see below. Reviewer text is typeset in blue, and our responses are typeset in black.Reviewer 1This work provides a computational approach for producing timed patterns of gene expression. In nature, timed gene expression has been observed in numerous organisms, and is particularly well characterized within bacteriophage. The authors thus use bacteriophage genes as a template, using several rounds of evolutionary simulations to produce the desired expression patterns. While expression of only a small number of genes are considered for most of this work, authors show that their approach can be scaled up to consider a larger number of genes. Notably, while many phage use specialized proteins to regulated timed gene expression (such as T7 RNA polymerase or C1 repressor), the authors show that timed expression can be achieved without the use of such proteins. Overall, I would recommend the publication of this work with minor changes. See below for additional comments.Thank you for your overall positive evaluation of our work.1. A weakness in this work stems from the fact that protein translation levels were not considered. As noted in the manuscript’s introduction, ribosome binding site strength can modulate the rate of protein translation initiation by multiple orders of magnitude. Additionally, codon usage within a gene can also modulate the rate of translational elongation. Given that these translational parameters would need to be optimized within the same sequence space as RNase cleavage sites, it would likely be necessary to also consider protein translation before the computational results described in this work could be replicated within the lab. The authors should expand their discussion to include how these parameters could be accommodated by their model.You are correct, there is an additional level of control that living organisms have access to, at the level of translation, which we did not consider here. We believe we had already covered it, in our paragraph about future research directions (2nd to last paragraph in Discussion), but to strengthen this point we have now also added the following text in the Discussion, after the paragraph starting with “While our results provide insight into…”:However, in any biological application, we would ultimately want to manipulate protein abundances, and we have here only considered transcriptional regulation. Protein abundances are determined both by transcript abundances and by additional parameters relevant at the translation stage, such as the rate of translation initiation or the speed of translation elongation, which may be modulated by codon usage or mRNA secondary structure (Novoa & Ribas do Pouplana 2012). These additional parameters would have to be considered in our simulation model to produce realistic predictions. We note that the Pinetree simulator we used here already has the capability to model translation, by tracking the movement of individual ribosomes along transcripts (Jack & Wilke 2019). It would be straightforward to evolve systems towards target protein abundances rather than transcript abundances, and to allow mutations in translation initiation rates and/or in codon usage bias, so that each gene can evolve unique translation initiation and elongation rates.New reference:E. M. Novoa, L. Ribas de Pouplana (2012). Speeding with control: codon usage, tRNAs, and ribosomes. Trends in Genetics 28:574-581. https://doi.org/10.1016/j.tig.2012.07.0062. Particularly interesting are expression patterns shown in Figure 3, in which Gene A initially has higher expression level compared to Gene B, but later in the simulation Gene A is surpassed by expression of Gene B. This pattern seems to arise Gene A having a faster initial rate of expression, but an earlier plateau. Would it be possible to devise a configuration where Genes A, B, and C to each predominate at different time points? This would be useful for the authors to show, as it would demonstrate that the different expression patterns are more complex that simply “stronger” vs “weaker” expression.We agree it would be useful to find a configuration where genes A, B, and C predominate at different time points, but we have been unable to do so. In fact, even patterns with a single crossover event (as e.g. in Fig. 2, right-most case) were not that easy to evolve. This doesn’t mean that such patterns are impossible to generate, it only implies that they are rare among the possible range of genome architectures in our model system.In the Discussion, we have revised the paragraph starting with “Our findings here highlight” to better address this point:Our findings here highlight some of the strengths and the limitations of regulatory approaches built solely off varying the strength of transcription, termination, and transcript degradation. The target phenotypes we used were fairly simple. We initially only considered cases in which each gene’s transcript abundance increased linearly. We then gradually increased the complexity of each gene’s expression pattern by allowing some genes to achieve steady state at certain time points. Such patterns could be evolved but we only observed a few successful simulations among 50 replicates. Our system shows that gene products can either accumulate at different rates over time via variable strength transcription or plateau at different levels by tuning degradation rates. In the most complex cases that we investigated, a combination of variable production and degradation rates was sufficient to produce expression dynamics whereby the relative ordering of genes by abundance could switch over time (Fig. 3, patterns 7-10).3. Additional commentary regarding how initial expression rates versus the level at which a curve plateaus are differentially controlled would of benefit to this manuscript. For instance, what parameters lead to fast initial expression, and a low level plateau? What parameters result in a slow initial expression level, and a high level plateau?We have addressed this comment by adding a new subsection in results that summarizes how different patterns are generated by certain genome architectures. This subsection is called “Summary of observed design patterns”:From the entirety of our simulation results, we have identified the following patterns of regulatory control. In general, combinations of promoters and terminators only lead to a linear increase of transcript abundances, possibly at differing rates, as the effect of these regulatory elements does not depend on the current abundance of a transcript. For instance, for the first gene to be expressed more rapidly than the second gene, there can be a strong promoter in front of the first gene and a weak terminator in front of the second gene. Examples include the first column of Fig. 2 and also Fig. 5. By contrast, a weak promoter in front of the first gene and a second weak promoter in front of the second gene will create a pattern where the second gene is expressed more rapidly than the first gene (second column of Fig. 2). RNase cleavage sites are required for time courses where transcript levels reach a steady state, as the amount of degradation following cleavage is proportional to transcript levels and thus increases with increasing transcript abundance (examples include the three right-most columns in Fig. 2 and all genome architectures in Fig. 4B).For time courses with a cross-over event, where one transcript levels off while another, initially less abundant transcript keeps steadily increasing (last column in Fig. 2), we have found that typically a weak promoter precedes the first gene, a strong promoter precedes the second gene, and an RNase cleavage site also precedes the second gene but follows the second promoter. The weak promoter allows for the first gene to steadily increase at a low rate. The strong promoter rescues transcription of the downstream gene. The following RNase cleavage site will degrade the transcripts produced, in proportion to the transcripts’ abundance, and thus the transcripts of the second gene will eventually reach steady state.Finally, even though RNase cleavage sites will generally cause eventual steady state, we have also observed cases where a weak RNase cleavage site follows a strong promoter, resulting in a time course that does not meaningfully level off on the time scale of our simulation results (examples include the third, fourth, and fifth genome architecture in Fig. 4A). In those cases, the RNase cleavage site simply serves to reduce the effect of the promoter. This observation highlights that the transcript level at which a steady state is reached depends on the relative strengths of the promoter and the RNase cleavage site, and for sufficiently strong promoters and weak cleavage sites a steady state may never be observed during physiological simulation times, even though it exists and would be reached if we let the simulations run for sufficiently long times.Reviewer 2In this manuscript, Shah et al. use their previously developed model, Pinetree, to engineer phage genomes in silico and predict the evolution of specific phenotypes. In doing so they show the ability to encode for dynamic gene expression without the use of multiple transcription factors, but rather by varying promoter, terminator, and RNA cleavage sites. They stochastically simulated the addition and removal of such sites while selecting for mutations that more closely match the desired pattern. The models they use and stochastic simulations allow for rapid evolution towards varying gene expression patterns of up to ten genes. In addition, they briefly discuss how these genomic architectures relate to those found in bacteriophages in nature. The work is novel to my knowledge, interesting, and well carried out. With a few modifications listed below and more discussion about the relevance to bacteriophage genomes and phenotypes, I recommend it for publication in PLOS One.Thank you for your overall positive evaluation of our work.Major Points1. In the results section, can you briefly relate the evolved architectures to their prevalence in bacteriophages in nature? For example, is it common to see a terminator between a promoter and its gene? You discuss this broadly in the discussion (line 393) but would be interesting to compare more specifically your evolved phenotypes.We don’t have easy access to a systematic accounting of the different architectures in nature, as it would require careful experimental study of each individual phage genome. Therefore, we now describe genome architectures of three example phages, which should serve as sufficient motivation for our study. In fact, all three phages display the types of regulatory patterns that we evolved in our simulations.We have added the following text as the third paragraph in the Discussion:The arrangement of regulatory elements that arose in our simulations look similar to what is observed in bacteriophage genomes. For example, the T7 genome contains 10 RNase III cleavage sites, six of which co-occur in the same intergenic region as a promoter (Dunn and Studier, 1983). This arrangement of promoters and cleavage sites results in transcript ramps and cliffs (Jack et al., 2019), where transcript abundances fall off near cleavage sites, and then steadily increase due to the presence of promoters. Similarly, bacteriophage phiX174 has a small (5386 nt) circular genome composed of 11 genes, including a cluster of co-localized capsid genes (Sanger et al., 1977). Capsid gene expression is regulated by four terminators of varying strengths (Hayashi et al., 1981), arranged in an alternating gene-terminator pattern similar to that of the evolved ten-gene architecture (Fig. 5). In phiX174, terminator readthrough produces a stepwise pattern of decreasing transcript abundances (Logel and Jaschke, 2020), which may help ensure that structural proteins are produced with the stoichiometries necessary for proper capsid assembly. PhiX174 also contains an overlapping promoter-terminator element, situated in the intergenic region between its structural and replication genes. In our evolved genome architectures, there is at least one example of such an element (Fig. 1A, right two panels); in both our simulation and in phiX174, this regulatory feature appears to decouple expression of the adjacent genes. Finally, bacteriophage T4, which has a large genome encoding over 300 genes, makes use of at least 119 promoters, as well as a number of terminators and mRNA cleavage sites, many of which co-occur as groups of two or more regulatory elements (Miller et al., 2003). Combinations of promoters, terminators, and cleavage sites give rise to the complex, time-dependent gene expression patterns observed in T4, which are further controlled by phage-encoded regulatory proteins.References in order of appearance:Dunn and Studier, 1983: https://doi.org/10.1016/S0022-2836(83)80282-4Jack et al., 2019: https://doi.org/10.1093/ve/vez055Sanger et al., 1977: https://doi.org/10.1038/265687a0Hayashi et al., 1981: https://doi.org/10.1128/jvi.38.1.198-207.1981Logel and Jaschke, 2020: https://doi.org/10.1016/j.virol.2020.05.008Miller et al., 2003: https://doi.org/10.1128/MMBR.67.1.86-156.2003a. Are the target patterns based on natural pathways and if so, can you compare that genomic architecture to your evolved ones?No, the target patterns were not based on natural pathways. We simply devised a range of basic patterns that should be useful in principle. See also our response to Point 2 of Reviewer 1, where we talk about the topic of target pattern selection in more detail.Minor Edits1. In Figure 1A explain the use of the symbols in the genetic diagrams and if consistent throughout the remaining figures in the paper.Symbols are used consistently throughout the manuscript. To clarify what they mean, we have added the following text to the caption of Fig 1:This architecture has a promoter in front of genes X and Y (indicated by inverted L with an arrow pointing right), a transcriptional terminator after genes Y and Z (indicated by a T), and an RNase cleavage site in front of gene Y (indicated by a cross). Line colors correspond to the genes labeled in the genome architecture. We use these symbols and colors consistently throughout this work to display genome architectures.a. Also explain these final genomic architectures in the figure caption(s) or main text for the evolved phenotypes (i.e. the final pattern genotype has promoters in front of each gene and terminators at X positions…)We have done so where feasible. It would be repetitive to enumerate every single architecture, but we agree that more extensive descriptions of genome architectures are useful. For an example of the revisions made, see our response to the previous point. Also see the new subsection we have added to the Results, in response to Point 3 of Reviewer 1.2. Do the three small grey squares at the bottom of Fig.4 panels A&B mean anything?Yes, they are an ellipsis, meant to indicate that there were additional architectures that aren’t shown, as in: A, B, C, …In hindsight, we agree that the figure was a bit confusing. We have revised the figure so that now the dots are arranged vertically. That should make it more immediately obvious that the dots are an ellipsis. We have also clarified in the figure caption that there are additional genome architectures that are not shown, by adding the following text:Only the architectures that evolved most frequently are shown. Multiples on the right indicate how often each architecture arose among 50 replicates.3. In Figs S3 and S4, what is the transcription regulation of the 3 genes being varied in order? Is it the same for all orders of the genes for each pattern?a. How does Fig S3 differ from the first row of graphs presented in Fig S4?First, we’d like to emphasize that Figs S3 and S4 are entirely different and serve different purposes. Fig S3 shows different possible target patterns, so the concept of regulatory architecture doesn’t apply there. Its purpose is to show that gene identity matters, and that we need to take this into account in our analysis.The first row of Fig S4 shows that the pattern highlighted in Fig S3 can be evolved for all gene orders. Other rows in Fig. S4 show that the same is (broadly) true for the other target patterns considered. In general, they can be matched regardless of gene order. We have revised the caption of Fig S4 to explain this point more clearly.Submitted filename: PLOS_ONE_Reviewer_Response.pdfClick here for additional data file.11 May 2022Generating dynamic gene expression patterns without the need for regulatory circuitsPONE-D-21-23762R1Dear Dr. Hockenberry,We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.Kind regards,William Ott, Ph.D.Academic EditorPLOS ONEAdditional Editor Comments (optional):The authors have elegantly addressed all of the points raised by Referees 1 and 2.Reviewers' comments:18 May 2022PONE-D-21-23762R1Generating dynamic gene expression patterns without the need for regulatory circuitsDear Dr. Hockenberry:I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.If we can help with anything else, please email us at plosone@plos.org.Thank you for submitting your work to PLOS ONE and supporting open access.Kind regards,PLOS ONE Editorial Office Staffon behalf ofDr. William OttAcademic EditorPLOS ONE
Authors: Sriram Kosuri; Daniel B Goodman; Guillaume Cambray; Vivek K Mutalik; Yuan Gao; Adam P Arkin; Drew Endy; George M Church Journal: Proc Natl Acad Sci U S A Date: 2013-08-07 Impact factor: 11.205
Authors: Simon Roux; David Páez-Espino; I-Min A Chen; Krishna Palaniappan; Anna Ratner; Ken Chu; T B K Reddy; Stephen Nayfach; Frederik Schulz; Lee Call; Russell Y Neches; Tanja Woyke; Natalia N Ivanova; Emiley A Eloe-Fadrosh; Nikos C Kyrpides Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971
Authors: Sarah M Richardson; Leslie A Mitchell; Giovanni Stracquadanio; Kun Yang; Jessica S Dymond; James E DiCarlo; Dongwon Lee; Cheng Lai Victor Huang; Srinivasan Chandrasegaran; Yizhi Cai; Jef D Boeke; Joel S Bader Journal: Science Date: 2017-03-10 Impact factor: 47.728
Authors: Ophelia S Venturelli; Mika Tei; Stefan Bauer; Leanne Jade G Chan; Christopher J Petzold; Adam P Arkin Journal: Nat Commun Date: 2017-04-26 Impact factor: 14.919