Julia Rohlhill1, Nicholas R Sandoval2, Eleftherios T Papoutsakis1. 1. Department of Chemical & Biomolecular Engineering and the Delaware Biotechnology Institute, University of Delaware , Newark, Delaware 19711, United States. 2. Department of Chemical & Biomolecular Engineering, Tulane University , New Orleans, Louisiana 70118, United States.
Abstract
Tight and tunable control of gene expression is a highly desirable goal in synthetic biology for constructing predictable gene circuits and achieving preferred phenotypes. Elucidating the sequence-function relationship of promoters is crucial for manipulating gene expression at the transcriptional level, particularly for inducible systems dependent on transcriptional regulators. Sort-seq methods employing fluorescence-activated cell sorting (FACS) and high-throughput sequencing allow for the quantitative analysis of sequence-function relationships in a robust and rapid way. Here we utilized a massively parallel sort-seq approach to analyze the formaldehyde-inducible Escherichia coli promoter (Pfrm) with single-nucleotide resolution. A library of mutated formaldehyde-inducible promoters was cloned upstream of gfp on a plasmid. The library was partitioned into bins via FACS on the basis of green fluorescent protein (GFP) expression level, and mutated promoters falling into each expression bin were identified with high-throughput sequencing. The resulting analysis identified two 19 base pair repressor binding sites, one upstream of the -35 RNA polymerase (RNAP) binding site and one overlapping with the -10 site, and assessed the relative importance of each position and base therein. Key mutations were identified for tuning expression levels and were used to engineer formaldehyde-inducible promoters with predictable activities. Engineered variants demonstrated up to 14-fold lower basal expression, 13-fold higher induced expression, and a 3.6-fold stronger response as indicated by relative dynamic range. Finally, an engineered formaldehyde-inducible promoter was employed to drive the expression of heterologous methanol assimilation genes and achieved increased biomass levels on methanol, a non-native substrate of E. coli.
Tight and tunable control of gene expression is a highly desirable goal in synthetic biology for constructing predictable gene circuits and achieving preferred phenotypes. Elucidating the sequence-function relationship of promoters is crucial for manipulating gene expression at the transcriptional level, particularly for inducible systems dependent on transcriptional regulators. Sort-seq methods employing fluorescence-activated cell sorting (FACS) and high-throughput sequencing allow for the quantitative analysis of sequence-function relationships in a robust and rapid way. Here we utilized a massively parallel sort-seq approach to analyze the formaldehyde-inducible Escherichia coli promoter (Pfrm) with single-nucleotide resolution. A library of mutated formaldehyde-inducible promoters was cloned upstream of gfp on a plasmid. The library was partitioned into bins via FACS on the basis of green fluorescent protein (GFP) expression level, and mutated promoters falling into each expression bin were identified with high-throughput sequencing. The resulting analysis identified two 19 base pair repressor binding sites, one upstream of the -35 RNA polymerase (RNAP) binding site and one overlapping with the -10 site, and assessed the relative importance of each position and base therein. Key mutations were identified for tuning expression levels and were used to engineer formaldehyde-inducible promoters with predictable activities. Engineered variants demonstrated up to 14-fold lower basal expression, 13-fold higher induced expression, and a 3.6-fold stronger response as indicated by relative dynamic range. Finally, an engineered formaldehyde-inducible promoter was employed to drive the expression of heterologous methanol assimilation genes and achieved increased biomass levels on methanol, a non-native substrate of E. coli.
Precise control of gene expression
is a necessity for designing predictable gene circuits in synthetic
biology and increasing the yields of products encoded by biosynthetic
pathways. Controlling the rate of transcription typically involves
controlling the interactions between the RNA polymerase (RNAP), the
promoter DNA, and any associated transcriptional regulators. Constitutive
expression systems, while often used for heterologous protein production,
fail to optimize expression levels of metabolic intermediates and
thus often require the cell to carry high metabolic burdens.[1,2] Synthetic gene regulatory schemes frequently employ transcription
factors such as activators and repressors to introduce various positive-
and negative-feedback control mechanisms.Complex synthetic
regulatory networks require orthogonal transcription
factors with preferably modular DNA-binding and -sensing domains.
When controlling the expression of measurable reporters, such as fluorescent
proteins or antibiotic resistance markers, these transcription-factor-based
biosensors have shown promise by increasing the efficiency of high-throughput
screens and selections and by allowing real-time monitoring of intracellular
metabolite concentrations in dynamic pathway regulation.[3,4] Dynamic regulation, which is widespread in natural systems, eliminates
the need for expensive inducers and offers the potential for optimized
schemes that minimize unnecessary metabolic burdens through autonomous
pathway balancing.[4] Efforts to expand the
synthetic biology toolbox have focused on characterizing a range of
biosensors[5] and engineering existing regulators[6] to respond to new effectors.[7,8] Biosensor
development could greatly benefit from additional small-molecule sensors
and the elucidation of their corresponding operators.Protein–DNA
binding interactions can be investigated and
ultimately manipulated by quantifying the sequence–function
relationship of promoter DNA. A method for elucidating sequence–function
relationships employs fluorescence-activated cell sorting (FACS) and
high-throughput sequencing and is generally called “sort-seq”
or “FACS-seq”.[9] These sort-seq
schemes begin with the generation of a library of mutated sequences
for the regulatory element or protein of interest that is large and
diverse enough to contain variants displaying a wide range of activities.[9] This library is linked to a genetic reporter
and sorted into bins on the basis of fluorescence. Here, green fluorescent
protein (GFP) expression levels represent the activity of the promoter
library cloned upstream of the gfp gene (Figure ). Subsequent sequencing
of gated populations enables the use of various analysis methods to
quantify the activities of hundreds of thousands of variants. One
such technique, originating from information theory, allows the quantification
of the relationship between two variables, here the base at each nucleotide
position (sequence) and output expression level (function) as determined
by discrete sorted bins.[10] This quantification
is achieved by calculating the mutual information, that is, the dependence
of the two random variables on each other:[11,12]where b is the base
at position i, μ is the
activity bin, f(x, y) and f(x) represent
the joint and marginal frequency distributions, respectively, and c is a correction factor.[12,13] If the bases
at position i are independent of the resulting expression
bin μ, that position is inconsequential to gene expression.
Similarly, mutations with skewed distributions, occurring more frequently
in low- or high-expression bins, identify vital nucleotide positions
that play a deterministic role in the expression level and the resulting
expression bin. While sort-seq approaches have been used to investigate
regulatory sequences and proteins,[14] they
have rarely been used in combination with mutual information techniques.
Two papers of interest used the approach to analyze mammalian enhancers[11] (termed a massively parallel reporter assay
(MPRA)) and CRP activator binding[12] to
the prokaryotic lac promoter.
Figure 1
Sort-seq experimental
method. The promoter library was generated
using error-prone polymerase chain reaction (PCR) and transformed
into NEB5α and ΔfrmR strains. The resulting
populations spanned a large range of GFP expression levels and were
sorted into seven or eight bins using FACS. The sorted populations
were tagged and the promoters sequenced, allowing for the identification
of mutations leading to higher or lower expression levels. These mutations
could then be used to generate inducible promoters with predictable
and tunable responses.
Sort-seq experimental
method. The promoter library was generated
using error-prone polymerase chain reaction (PCR) and transformed
into NEB5α and ΔfrmR strains. The resulting
populations spanned a large range of GFP expression levels and were
sorted into seven or eight bins using FACS. The sorted populations
were tagged and the promoters sequenced, allowing for the identification
of mutations leading to higher or lower expression levels. These mutations
could then be used to generate inducible promoters with predictable
and tunable responses.Formaldehyde is a toxic compound but also a common cellular
metabolite
produced endogenously in all cells at low concentrations from various
demethylation reactions.[15]Escherichia coli has a native formaldehyde-inducible
promoter, P, that is found upstream
of the frmRAB formaldehyde-detoxification operon.
FrmR, the first product of the operon, is a member of the DUF156 family
of DNA-binding transcriptional regulators.[16] It binds the frmRAB promoter region and is negatively
allosterically modulated by formaldehyde.[16,17] FrmR is specific to formaldehyde, responding to acetaldehyde, methylglyoxal,
and glyoxal to far lesser degrees and not at all to a range of other
aldehydes and alcohols tested.[16,17] The genes frmA and frmB encode a formaldehyde dehydrogenase and S-formylglutathione hydrolase, respectively, and are responsible
for detoxifying formaldehyde to formic acid in a glutathione-dependent
pathway.[18] The negative-feedback regulation
of the frmRAB operon is similar to that of many other
prokaryotic operons, whereby the transcription factor represses its
own transcription.[19] Characterizing P and the P–FrmR relationship adds another operator–regulator
to the synthetic biology toolkit and offers further insight into protein–DNA
molecular binding mechanisms.In addition to its ubiquitous
role in all cells, formaldehyde is
a central metabolic intermediate for methylotrophs. Synthetic methylotrophy,
or the utilization of C1 compounds such as methanol as
a carbon and energy source by non-native methylotrophs, has been pursued
in earnest recently as methanol availability increases and its price
declines.[20,21] Previous studies have shown labeling in E. coli glycolytic intermediates from 13C-methanol by heterologous expression of three enzymes from Bacillus methanolicus.[22] Methanol dehydrogenase (Mdh) converts methanol to formaldehyde,
and hexulose-6-phosphate synthase (Hps) condenses formaldehyde with d-ribulose 5-phosphate to form hexulose 6-phosphate (Hu6P),
after which phospho-3-hexuloisomerase (Phi) isomerizes Hu6P to fructose-6-phosphate
(F6P) for entrance into central metabolism. Our group recently utilized
a superior Mdh from Bacillus stearothermophilus(23) along with the B. methanolicus codon-optimized Hps and Phi enzymes to achieve E.
coli growth on methanol with a small (1 g/L) yeast
extract supplementation, demonstrating extensive 13C labeling
from 13C-methanol into glycolytic and tricarboxylic acid
intermediates and amino acids, as well as methanol conversion to the
specialty chemical naringenin.[24] We have
also demonstrated a strategy of scaffoldless enzyme assembly that
can be used to achieve superior outcomes in synthetic methylotrophy.[25] Placing formaldehyde assimilation genes under
the control of formaldehyde regulation emulates the native regulation
of the methylotroph B. methanolicus, whereby hps and phi are transcriptionally
induced by formaldehyde,[26] and results
in autonomous pathway balancing. This dynamically regulated substrate
utilization scheme is particularly beneficial considering the toxicity
of formaldehyde, the metabolic burden of constitutively expressing
the heterologous methanol assimilation genes at high levels, and the
additional burden expected from future strain engineering for the
production of valuable chemicals or secondary metabolites.Here
we first characterize the native E. coli P response and regulation, identifying
parameters for improvement. We investigate the influence of the P architecture on the strength of repressor
binding by testing a set of P variants
and isolate approximate FrmR-binding regions. We then describe the
deconstruction and analysis of the P promoter using a sort-seq approach, obtaining, with single-nucleotide
resolution, a map of the importance of each nucleotide position for
expression, both with and without formaldehyde induction. The analysis
of the resulting rich data set allowed us to identify mutations capable
of changing promoter activity in a directed manner by manipulating
repressor and RNAP binding interactions, and this information was
used to design a set of formaldehyde-responsive promoters with tunable
basal and induced expression levels. An engineered promoter was further
used to implement and improve formaldehyde-controlled E. coli consumption of the non-native substrate methanol.
Results
and Discussion
The E. coli Formaldehyde-Inducible
Promoter Is an Ideal Candidate for Engineering
We aimed to
construct a formaldehyde reporter plasmid and analyze the response
of the native P. To construct the reporter
plasmid, termed P–GFP–P–FrmR, P was cloned upstream of gfp, and the FrmRrepressor was cloned under the control of the lac promoter to limit titration issues. P promoter activity was assayed by monitoring GFP expression via flow
cytometry. The expression of the reporter plasmid was assayed in the
NEB5α and ΔfrmR strains, representing
two different regulatory systems (Figure a,b). In the NEB5α strain, FrmR levels
were autoregulated by P on the E. coli chromosome in addition to being expressed
from the reporter plasmid. In the ΔfrmR strain,
FrmR was expressed only from the reporter plasmid.
Figure 2
Regulatory mechanisms
and response of the E. coli formaldehyde-inducible
promoter. Two configurations were used to
investigate regulation. A P–GFP–P–FrmR reporter plasmid was used
(a) with or (b) without chromosomal expression of FrmR under control
of P. Representative time–response
curves for the two configurations after induction with 0–500
μM formaldehyde are shown below their respective regulatory
schemes. (c) Response curves fit to the Hill equation for the NEB5α
strain with plasmid-expressed FrmR (blue circles) and the ΔfrmR strain with plasmid-expressed FrmR (green triangles).
Numbers denote Hill coefficients. Error bars represent standard deviations
of two replicates tested on different days.
Regulatory mechanisms
and response of the E. coliformaldehyde-inducible
promoter. Two configurations were used to
investigate regulation. A P–GFP–P–FrmR reporter plasmid was used
(a) with or (b) without chromosomal expression of FrmR under control
of P. Representative time–response
curves for the two configurations after induction with 0–500
μM formaldehyde are shown below their respective regulatory
schemes. (c) Response curves fit to the Hill equation for the NEB5α
strain with plasmid-expressed FrmR (blue circles) and the ΔfrmR strain with plasmid-expressed FrmR (green triangles).
Numbers denote Hill coefficients. Error bars represent standard deviations
of two replicates tested on different days.Time–response curves for formaldehyde concentrations
from
1 to 500 μM show maximum activity from 100 to 250 min and up
to an 8-fold dynamic range, calculated as the ratio of induced activity
to uninduced activity (Figure a,b). FrmR expression was expectedly higher in the NEB5α
strain, as evidenced by the 3-fold lower GFP expression at time zero
compared with the ΔfrmR strain. The response
curves were modeled with the Hill equation,[27,28] which is used to characterize the induction response and cooperativity
of the system. This is of particular interest here because of the
tetrameric structure of FrmR.[29,17] The Hill equation is
given by eq :where Pmin and Pmax represent the basal promoter activity and
the maximum promoter activity following induction, respectively; [I]
is the formaldehyde inducer concentration; n is the
Hill coefficient; and K is the apparent dissociation
constant, which is related to the inducer concentration at which the
promoter activity is half-maximal. The Hill coefficient indicates
the cooperativity of the system (i.e., the positive or negative effect
a single ligand-binding event has on subsequent events), with increasing
values >1 corresponding to more sigmoidal response curves and higher
cooperativity. The apparent Hill coefficient was 1.18 ± 0.13
when FrmR was present on the chromosome, indicating a noncooperative
promoter–repressor–RNAP interaction when FrmR is both
expressed from a plasmid and autoregulated by formaldehyde (Table S4). Without chromosomal FrmR expression,
the apparent Hill coefficient was 0.46 ± 0.02 (Table S4). The response curve for the NEB5α strain without
plasmid-expressed FrmR can be seen in Figure S2. Disruption of negative autoregulation typically increases the Hill
coefficient, leading to a tighter sigmoidal response curve.[30,31] The opposite trend is seen here, with derepression continually increasing
with higher formaldehyde concentrations, possibly as a result of the
toxicity of formaldehyde and the interruption of autoregulation due
to the high levels of plasmid-expressed FrmR.Compared with
similarly characterized operator–regulator
biosensors with dynamic ranges up to 210-fold,[5] the 5.5-fold range of the P–FrmR
system at 100 μM formaldehyde induction has ample room for improvement.
It is also highly sensitive, responding to dosed formaldehyde concentrations
as low as 1 μM. This initial characterization suggests that
the E. coli P is a strong candidate for engineering. Further promoter characterization
through deconstruction and analysis of the P architecture can identify operator regions with high engineering
potential.
Inverted Repeats Are Central to P Architecture and Response
The most distinct
feature of
the E. coli P promoter is a 19 base pair (bp) perfect inverted repeat, originally
hypothesized as a FrmR binding site (Figure ).[32] Operator
sequences often contain inverted repeats, which can form hairpin loops,
an important structural feature for protein–DNA interactions.[33] A similarly regulated FrmR homologue was identified
in Salmonella enterica under the control
of a promoter lacking the large inverted repeat of the E. coli P.[29] A smaller incomplete 5′-ATAGTATA/TATAGTAT-3′ inverted
repeat was noted within the Salmonella promoter, disruption of which was shown to ablate FrmR binding.[16] Here, P–GFP
constructs were generated with different architectures to investigate
each side of the large 19 bp inverted repeat, here termed site A and
site B. Sites were replaced with scrambled and reverse-complemented
sequences to test the importance of their presence and orientation
for FrmR binding.
Figure 3
Response of P binding site
variants
3 h after dosing with 0 or 100 μM formaldehyde. The 19 bp inverted
repeat is shown with two green arrows facing one another to represent
their complementary relationship. These two FrmR binding sites were
analyzed by removing or reverse-complementing the sites in tested
constructs. The promoter was deleted to yield the negative control,
and the positive construct is the native P sequence. Error bars represent standard deviations of two
replicates tested on different days.
Response of P binding site
variants
3 h after dosing with 0 or 100 μM formaldehyde. The 19 bp inverted
repeat is shown with two green arrows facing one another to represent
their complementary relationship. These two FrmR binding sites were
analyzed by removing or reverse-complementing the sites in tested
constructs. The promoter was deleted to yield the negative control,
and the positive construct is the native P sequence. Error bars represent standard deviations of two
replicates tested on different days.Replacing site A and site B individually with scrambled sequences
resulted in constructs 1 and 3, respectively (Figure ). Both constructs showed response to formaldehyde,
supporting the hypothesis that each site is independently capable
of binding FrmR. The presence of only one site leads to higher expression
compared with the native two-site P, an effect that is amplified with 100 μM formaldehyde induction.
The formaldehyde concentration of 100 μM was chosen to provide
strong induction without inhibiting growth. Site B leads to greater
repression compared with site A, as evidenced by the higher expression
levels for construct 3 compared with construct 1. The enhanced effect
of site B is likely due to its position, which is closer to the transcription
start site and overlaps with the −10 site.Because of
the overlap between FrmR binding site B and the −10
site, it was difficult to resolve expression differences due to FrmR
binding from those due to RNAP binding. In order to investigate this,
we shifted site B upstream, decreasing the space between sites A and
B from 15 to 10 nucleotides and separating site B from the −10
region in constructs 6–8. The −10 site was optimized
to the canonical “TATAAT” sequence to investigate the
effect of FrmR binding with increased basal expression due to RNAP
binding. FrmR was still capable of binding to the shifted site B,
as shown by the formaldehyde responsiveness of construct 7, which
lacks site A. Construct 6 included both fully intact sites and maintained
low basal expression while more than doubling the induced expression
with a 7.3-fold dynamic range, suggesting the ability to increase
the dynamic range of the promoter by separating the manipulation of
RNAP binding from the transcriptional repressor binding.Ablation
of FrmR binding was achieved in constructs 4 and 5, as
shown by the lack of formaldehyde response. The expression levels
of constructs 4 and 5 should therefore represent only the effect of
RNAP binding on transcription. Construct 4 exhibits 1.6-fold higher
expression levels than construct 5, possibly because of an effect
of the scrambled or reversed site A sequence on RNAP binding. Construct
4 utilizes scrambled sequences for both site A and site B, confirming
their necessity for FrmR binding in E. coli. Construct 5 has a scrambled site B and a partial reverse complement
for site A that was unable to recover binding. The partial reverse
complement of site A cannot bind FrmR independently on the basis of
construct 5, but it appears to cause stronger repression when site
B is also present. Construct 2, with a reversed site A, shows lower
basal and induced expression compared with construct 1, which had
a scrambled site A. The analogous construct 8 with reversed site A
also showed stronger basal and induced repression compared with construct
7 with scrambled site A. This indicates a relationship between the
binding sites since the partial reverse complement of site A contributes
to repression only when site B is present and not independently.
High-Diversity Library Generation Ensures Rich Information Output
We aimed to generate a high-diversity P library containing variants covering a wide range of basal
and induced activities, as we cannot analyze the effect of mutations
that are not represented. Promoter library construction requires careful
consideration for sort-seq experiments to ensure enough diversity
to create the variants of interest for downstream analysis. The 200
bp P promoter on the P–GFP–P–FrmR reporter plasmid was targeted for mutation with error-prone
polymerase chain reaction (PCR) using primers flanking the region
(Table S2). The variability present in
the P libraries was assayed using flow
cytometry and compared with that in the unmutated parent P–GFP−P−FrmR strain. Increasing mutation frequency in the promoter
region leads to a wider fluorescence distribution and therefore a
wider range of promoter activity. However, a critical threshold mutation
frequency causes an increasing percentage of the population to lose
function, usually because of an inability to initiate transcription
resulting from an inability to bind RNAP. The goal was to use a highly
diverse library containing millions of unique sequences while retaining
function. In this case, the majority of the population should also
retain formaldehyde responsiveness, since the difference in activities
among the library clones is essential for identifying repressor binding
sites with high precision and accuracy. The final P library was chosen after three successive rounds of error-prone
PCR achieved a mutation frequency that maximized the spread of the
expressing population while minimizing the relative size of the nonexpressing
population.The final library had an average of 6.6 mutations
per 200 bp of the promoter, with an average of 2.4 of those mutations
located within the first 80 bases. It covered a wider range of GFP
expression, as evidenced both visually (Figure S3) and by a 2.2-fold increase in the robust percent coefficient
of variation (%rCV), a metric for the spread of the fluorescence distribution
that is resistant to outlier effects (Table S5).[34] Importantly, the fluorescence distribution
of the P library showed a geometric
mean similar to that of the unmutated P, and therefore, its wider distribution was due not only to clones
with lower expression but also to clones with higher GFP expression.
Individual colonies were selected and assayed with flow cytometry
to confirm the existence of promoter variants resulting in unique
fluorescence distributions under both induced and uninduced conditions.Limiting constraints affecting the size and diversity of the promoter
library similarly limit the information output from sort-seq analysis.
Mutational bias was minimized by using a blend of polymerases (see Methods) instead of the Taq polymerase,
which has a well-documented preference for mutating A’s and
T’s.[35] Analysis of the final P library bias from sequencing results indicated
a slight preference for mutating C’s and G’s, with C
→ T/G → A being the most common mutation at a mutation
rate of 1.7% and A → C/T → G being the rarest at a rate
of 0.15% (Figure S4 and Table S6). The
per-position mutation frequency was 2.8% on average and ranged from
1% to 8% (Figure S5). Mutations at each
nucleotide position along the length of the 200 bp P are therefore represented in the final library, and while
mutation bias skews the depth of the library by variable representation
of mutations, this skew is taken into account in the calculation of
mutual information.
High-Resolution Binding Sites and Identification
of Mutations
of Interest from Sort-Seq Data
We aim to identify nucleotide
position targets for engineering and the FrmR binding region with
high precision and accuracy through analysis of the sequencing data
for each expression gate. Sequencing data were processed to calculate
information footprints for each of three experiments with different
levels of expressed and bound FrmR (Figures and S6). The
first two experiments involved the NEB5α strain under induced
(100 μM formaldehyde) and uninduced conditions with FrmR expressed
from both the plasmid and the chromosome (Figure ), and the third experiment involved FrmR
expression from the plasmid only in the ΔfrmR strain, presumably resulting in lower FrmR expression levels (Figure S6). Information footprints visualize
the contribution of each nucleotide in the promoter sequence to GFP
expression by analyzing the relationship between the mutations at
a specific nucleotide position and the bins into which the mutated
promoters were sorted.[12] Deleterious mutations
reducing transcription would be consistently sorted into low-expression
bins, identifying the corresponding nucleotide position as one with
high “information content”, or high potential to affect
gene expression. Similarly, high-information-content positions are
identified from mutations at positions causing higher GFP expression
and consistently falling into high-expression bins. High-information-content
nucleotide positions within the promoter region are therefore ideal
engineering targets for influencing output gene expression.
Figure 4
Information
footprints of the E. coli P with different levels of FrmR binding
in the NEB5α strain. The distributions of GFP expression from
10 000 cells are shown in the (a) uninduced and (c) induced
P libraries. Information footprints
for the (b) uninduced and (d) induced with 100 μM formaldehyde
experiments illustrate significant nucleotide positions with and without
FrmR bound. Yellow highlighting shows a large inverted repeat, while
the two left and two right orange sites indicate smaller inverted
repeats relevant to FrmR binding. Error bars indicate uncertainties
inferred from subsampling.
Information
footprints of the E. coli P with different levels of FrmR binding
in the NEB5α strain. The distributions of GFP expression from
10 000 cells are shown in the (a) uninduced and (c) induced
P libraries. Information footprints
for the (b) uninduced and (d) induced with 100 μM formaldehyde
experiments illustrate significant nucleotide positions with and without
FrmR bound. Yellow highlighting shows a large inverted repeat, while
the two left and two right orange sites indicate smaller inverted
repeats relevant to FrmR binding. Error bars indicate uncertainties
inferred from subsampling.Heat maps displaying the distribution of mutations across
different
sequencing bins reveal the effects of mutating any individual nucleotide
with exceptional precision (Figure ). While information footprints communicate the correlation
between nucleotide position and gene expression, heat maps include
more detailed information about the specific bases at each nucleotide
position. Information footprints do not differentiate between positive
or deleterious mutations, but heat maps visually display sequencing
information by showing the distribution of A, T, C, and G at each
nucleotide position for each expression gate. High-expression (up)
mutations of interest were identified by analyzing mutations with
a strong pattern of enrichment in high-expression sorting bins compared
with the unsorted P library. Similarly,
low-expression (down) mutations of interest had extremely low occurrence
in high-expression bins and were highly enriched in low-expression
bins. These trends confirm that the mutations of interest influenced
the level of GFP expression and resulting sorting bin.
Figure 5
Identification of up
and down mutations through sequence analysis.
(a) Heat maps of the P sequence in
each of eight sorting bins in the NEB5α strain under uninduced
conditions. Mutated promoters isolated from low-GFP-expression bins
are represented in the top heat maps, while those from high-GFP-expression
bins are represented in the lower heat maps. Enrichment was calculated
as the log2-fold change of each mutation relative to the
unsorted promoter library, where red mutations are highly enriched
and blue mutations are rare in each given bin. The native sequence
is shown in black below the heat maps, and the identified up/down
mutations are shown at their specified locations in green and red,
respectively. (b) Enrichment profiles for three identified down mutations
GCA → AGT from directly upstream of the −35 site are
shown for each sorting bin. Down mutations are highly enriched (red)
in lower-expression bins and rare (blue) in higher-expression bins.
(c) Enrichment profiles for two identified up mutations AA →
GG located far upstream. Up mutations are highly enriched (red) in
high-expression bins and rare (blue) in low-expression bins.
Identification of up
and down mutations through sequence analysis.
(a) Heat maps of the P sequence in
each of eight sorting bins in the NEB5α strain under uninduced
conditions. Mutated promoters isolated from low-GFP-expression bins
are represented in the top heat maps, while those from high-GFP-expression
bins are represented in the lower heat maps. Enrichment was calculated
as the log2-fold change of each mutation relative to the
unsorted promoter library, where red mutations are highly enriched
and blue mutations are rare in each given bin. The native sequence
is shown in black below the heat maps, and the identified up/down
mutations are shown at their specified locations in green and red,
respectively. (b) Enrichment profiles for three identified down mutations
GCA → AGT from directly upstream of the −35 site are
shown for each sorting bin. Down mutations are highly enriched (red)
in lower-expression bins and rare (blue) in higher-expression bins.
(c) Enrichment profiles for two identified up mutations AA →
GG located far upstream. Up mutations are highly enriched (red) in
high-expression bins and rare (blue) in low-expression bins.Comparing the information footprints
for induced and uninduced
experiments yields a single-nucleotide-resolution binding site for
the transcriptional repressorFrmR. Within the larger 19 bp inverted
repeat, four-nucleotide inverted repeats can be identified (Figure ). These 5′-ATAC/GTAT-3′
inverted repeats upstream of the −35 site and overlapping with
the −10 site have lower information content in the induced
state when less FrmR is bound to the region (Figure ). Within site A, three positions exist with
much higher information content in the uninduced state compared with
the induced state. Two are within the four-nucleotide inverted repeats
5′-ATAC/GTAT-3′, and one is
directly centered in the nine-nucleotide spacer between site A and
site B. Site B shows a similar pattern complicated by the −10
site. The two 5′-ATAC/GTAT-3′
nucleotide positions show higher information content in the uninduced
state versus the induced state. The 3′ side of the inverted
repeat is entirely within the −10 site. The mutational distribution
across expression bins also identifies secondary versions of the FrmR
binding site, noting a three-nucleotide 5′-ATA/TAT-3′
inverted repeat and an imperfect 10-nucleotide 5′-ATATAGAATA/TATAGTATAT-3′
inverted repeat directly flanking the six-nucleotide G and C tracts
in site A and site B, respectively (Figure S7). Sequences with mutations extending each of these four inverted-repeat
structures were sorted into low-expression bins, indicating that the
DNA hairpin loops adopted multiple possible conformations (Figure S7).The RNAP binding site, consisting
of two six-nucleotide regions
centered at the −35 and −10 positions, is easily identifiable
in the information footprints. The P promoter uses the canonical “TTGACA” −35 site
and the noncanonical “TAGTAT” −10 sit, with the
optimal 17-nucleotide spacer. An alternate “TATAGT”
−10 site two nucleotides upstream was previously hypothesized[32] but is not supported by the low information
content at those two positions. In agreement with the information
footprints, the heat maps (Figure ) show that mutations in the −35 and −10
regions occur frequently in the lowest-GFP-expression sorting bin.
This effect is particularly true for the −35 region, which
is the consensus sequence, but less so for the −10 region,
where three “up” mutations can be identified. Using
these detailed information footprints and binding site information
enabled the specific engineering of P promoters with predictable activities.
Informed Design of Tunable
Formaldehyde-Inducible Promoters
Mutations identified during
sequencing analysis were used in combination
to generate variants capable of a wider range of basal and induced
promoter activities. Twelve P variants
were cloned using inverse PCR (see Methods and Table S3). Repression mutations were
used for constructs 14 and 15, up mutations were used for construct
20, and combinations were used for other constructs. Site B down mutations,
A → T at position −25 and C → T at position −20,
extend the four-nucleotide inverted repeat to six nucleotides in constructs
14–17 and cause extremely low expression (Figure ). The induced expression from
only one of the four constructs is higher than the uninduced expression
from the native P. The same two down
mutations are present in constructs 22–25, but their effects
are negated by three up mutations in the −10 region, which
essentially scramble the inverted repeat within site B and cause much
higher basal and induced expression levels.
Figure 6
Response of specifically
constructed P variants 3 h after dosing
with 0 or 100 μM formaldehyde.
Variants were constructed with mutations for higher (green) or lower
(red) expression levels. The promoter was deleted to yield the negative
control, and the positive construct was the native P sequence. Error bars represent standard deviations of two
replicates tested on different days.
Response of specifically
constructed P variants 3 h after dosing
with 0 or 100 μM formaldehyde.
Variants were constructed with mutations for higher (green) or lower
(red) expression levels. The promoter was deleted to yield the negative
control, and the positive construct was the native P sequence. Error bars represent standard deviations of two
replicates tested on different days.Mutations that were highly enriched in high-expression bins
were
used in combination to create high-expression formaldehyde-responsive
promoters. Construct 20 (Figure ) features six up mutations, including three within
the −10 region, and had 27-fold higher uninduced GFP expression
compared with the native promoter. Construct 20 also retains formaldehyde
responsiveness, with 2-fold higher GFP expression in response to 100
μM formaldehyde than the native P. Construct 24 similarly displays high expression levels with only
two essentially nonfunctional down mutations. Increasing the site
A inverted repeat from four to five nucleotides has a significant
effect on repression, as seen from the 6-fold lower basal and 3-fold
lower induced expression in construct 25 compared with construct 24.
The designed constructs exhibited the expression levels expected on
the basis of the lengths of site A and site B inverted repeats for
down mutations or the disruption of sites for up mutations.Quantitative sequence activity models seek to predict the behavior
of variants assuming that mutations make additive contributions to
activity. These models fail to account for secondary structures in
the DNA and sequence features that are particularly important for
transcription factor binding. Individual down mutations may cause
lower GFP expression, but in combination they silence each other.
For example, a G → A mutation at position −39 increases
the length of the site A inverted repeat from four to five nucleotides,
as in constructs 14, 17, 18, 19, 20, and 25 (Figure ), while a T → C mutation at position
−57 would similarly lengthen the site inverted repeat. However,
when the two mutations occur together, the shorter four-nucleotide
site A is maintained as in constructs 15, 19, and 23.
Application
of the Engineered Formaldehyde-Responsive Promoter
Enables Higher Methanol Growth
The E. coli P is uniquely qualified to achieve
dynamically regulated E. coli growth
on methanol. Methanol is converted to the toxic intermediate formaldehyde
by methanol dehydrogenase (Mdh) in the first step of assimilation,
and therefore, proper pathway balancing is vital to prevent the accumulation
of formaldehyde in the cell and associated growth inhibition. Here,
with knowledge gained from our previous studies,[24] we pursue a strategy for autonomously sustainable syntheticE. coli methylotrophy using formaldehyde-inducible
promoters.P and the high-expression
P construct 20 (P20) were placed upstream of the methanol assimilation Mdh–Hps–Phi
operon in a ΔfrmA and Δpgi strain. The frmA gene, encoding formaldehyde dehydrogenase,
was deleted to minimize the loss of formaldehyde to carbon dioxide.
Formaldehyde dissimilation in the ΔfrmA strain
still occurs and has been attributed to promiscuous aldehyde dehydrogenase
activity, but it occurs at a much lower rate.[24] Phosphoglucose isomerase (pgi) was similarly deleted
to force the F6P from methanol assimilation down glycolysis, minimizing
the loss of carbon to carbon dioxide during the conversion of F6P
to glucose 6-phosphate and eventually ribulose 5-phosphate through
the oxidative pentose phosphate pathway. The P strain successfully consumed methanol and grew to a higher
cell density when the medium was supplemented with 60 or 240 mM methanol
(Figure a,b,d,e),
demonstrating for the first time formaldehyde-induced synthetic methylotrophy.
The yields on methanol, calculated as reported elsewhere[24] by assuming that methanol consumption accounted
for additional biomass in cultures supplemented with methanol, were
similar for the P and P20 strains (Figure c,f), but the P20 strain
achieved significantly higher biomass titers than the P strain with 60 or 240 mM methanol. We hypothesize
that the higher formaldehyde-induced expression of key methanol assimilation
genes in the P20 strain enable the
more efficient management and assimilation of toxic intracellular
formaldehyde.
Figure 7
Growth and yield on methanol for P–Mdh–Hps–Phi and P20–Mdh–Hps–Phi plasmids in the ΔfrmA ΔpgiE. coli strain. Strains were grown with M9 minimal medium supplemented with
1 g/L yeast extract with or without (a–c) 60 mM or (d–f)
240 mM methanol. (a, d) Growth curves normalized to a starting optical
density (OD) of 0.2. Numbers denote millimolar methanol consumed for
dosed P and P20 strains. (b, e) OD at 24 and 48 h for the P and P20 strains
with and without methanol dosing. (c, f) Yields on methanol for the
P and P20 strains in gram cell dry weight (CDW) per gram of methanol at
24 h. Error bars represent standard deviations (n = 4).
Growth and yield on methanol for P–Mdh–Hps–Phi and P20–Mdh–Hps–Phi plasmids in the ΔfrmA ΔpgiE. coli strain. Strains were grown with M9 minimal medium supplemented with
1 g/L yeast extract with or without (a–c) 60 mM or (d–f)
240 mM methanol. (a, d) Growth curves normalized to a starting optical
density (OD) of 0.2. Numbers denote millimolar methanol consumed for
dosed P and P20 strains. (b, e) OD at 24 and 48 h for the P and P20 strains
with and without methanol dosing. (c, f) Yields on methanol for the
P and P20 strains in gram cell dry weight (CDW) per gram of methanol at
24 h. Error bars represent standard deviations (n = 4).
Conclusions
We
have demonstrated the systematic and quantitative characterization,
dissection, and analysis of the E. coliformaldehyde-inducible promoter with single-nucleotide resolution.
We characterized the native P regulation
and response and determined the general FrmR operator region using
designed promoter variants. The sort-seq approach and analysis succeeded
in not only confirming the FrmR binding site but also quantifying
the effect of each nucleotide on both expression and formaldehyde
inducibility. Utilizing strategically placed up and down mutations,
we were able to engineer promoters with a range of basal and induced
expression levels in a predictable manner. Application of an engineered
formaldehyde-responsive promoter with higher basal and induced expression
levels before methanol assimilation genes achieved higher biomass
titers than the native E. coli P, demonstrating not only formaldehyde-controlled
synthetic methylotrophy but its improvement through a sort-seq-guided
engineering approach.The formaldehyde-inducible E. coli promoter is one of dozens of uncharacterized
promoters regulated
by simple inducible transcriptional regulators. The sort-seq method,
analysis, engineering, and application described here can be applied
to any transcriptional regulator–operator sequence to be used
in synthetic biology or metabolic engineering applications, particularly
for the characterization of additional biosensors for gene circuits
and dynamic pathway regulation.
Methods
Strains, Plasmids,
and Growth Media
E. coli strain
NEB5α (New England Biolabs (NEB),
Ipswich, MA) was used for plasmid cloning and maintenance. The ΔfrmR (JW0348-1) and ΔfrmA (JW0347-1)
knockout strains were ordered from the Keio deletion collection.[36] The double-deletion ΔfrmA Δpgi strain was constructed by the method
of Datsenko and Wanner[37] on the existing
Keio ΔfrmA strain cured of its kanamycin resistance
cassette. The genes on the P–Mdh–Hps–Phi
plasmid were placed in an operon configuration under the trc promoter,[38] and the RBS calculator v2.0[39,40] was used to design synthetic ribosome binding sites for each gene.
All of the strains and plasmids used can be found in Table S1. PCR primers were synthesized by Integrated DNA Technologies
(IDT) (Coralville, IA) and can be found in Table S2.For flow cytometry analysis, cells were grown in
MOPS minimal medium[41] supplemented with
0.4% (w/v) xylose, carbenicillin (100 μg/mL), and isopropyl
β-d-1-thiogalactopyranoside (IPTG) (0.1 mM). Minimal
medium was chosen because of the more defined response curves achieved
compared with rich medium. Cultures were inoculated from overnights
to an optical density (OD) of 0.05 in 5 mL in 15 mL disposable culture
tubes, incubated at 37 °C for 2–3 h, and dosed with 0–500
μM formaldehyde (MP Biomedicals, Santa Ana, CA). Inverse PCR
was used to create P variant constructs
1–8 and 14–25 by amplifying the plasmid backbones and
incorporating promoter mutations on the amplification primers (Table S3). Variant plasmids were recircularized
with the Q5 Site-Directed Mutagenesis Kit (NEB) and directly transformed
into high-efficiency chemically competent NEB5α cells (NEB).
Variant promoter sequences were confirmed with Sanger sequencing (UD
Sequencing and Genotyping Center).For methanol growth assays,
cells were grown in M9 minimal medium[42] supplemented with 1 g/L yeast extract, carbenicillin
(100 μg/mL), and 0, 60, or 240 mM methanol (Sigma-Aldrich, St.
Louis, MO). Overnight cultures were pelleted, resuspended in M9, and
used to inoculate 30 mL of fresh medium in 250 mL baffled flasks with
rubber stoppers to an OD of approximately 0.2. Growth curves were
normalized to a starting OD of 0.2. All of the E. coli cultures were grown at 37 °C with 250 rpm shaking.
Promoter Library
Generation
Error-prone PCR targeting
the 200 bp P (Figure S1) was performed using the GeneMorph II Random Mutagenesis
Kit (Agilent, Santa Clara, CA). The resulting promoter library was
purified with a PCR purification kit (QIAGEN, Germantown, MD) and
twice used as a template to obtain higher mutation rates. The P–GFP–P–FrmR plasmid was amplified omitting the native P, treated with DpnI (NEB), and extracted
from agarose with a gel extraction kit (QIAGEN). The purified promoter
library with unmutated overhang sequences was inserted back into the
plasmid backbone using the NEBuilder HiFi DNA Assembly Master Mix
(NEB). Two 20 μL reactions were transformed into a total of
20 aliquots containing 50 μL of chemically competent NEB5α
cells. Following a 1 h recovery in 250 μL of SOC medium (NEB),
all of the transformations were combined into two screw-top 125 mL
flasks with 30 mL of LB medium supplemented with 1% (w/v) xylose and
100 μg/mL ampicillin. Cultures were incubated at 37 °C
for 8 h, sampled every hour for flow cytometry analysis, and stored
frozen at −80 °C in 15% (v/v) glycerol. Plating on solid
LB medium supplemented with 100 μg/mL carbenicillin after the
initial 1 h recovery indicated a library size of approximately 11
million. For the creation of the ΔfrmR promoter
library, 4 mL of NEB5α library frozen stocks were thawed and
miniprepped (QIAGEN), and 1 μg was transformed into ΔfrmR electrocompetent cells. Following a 1 h recovery in
3 mL of SOC medium, the cells were transferred to 15 mL of LB medium
supplemented with 100 μg/mL ampicillin for 3 h and stored at
−80 °C.
Flow Cytometry and Sorting
Cells
were analyzed and
sorted with a BD FACSAria IIu flow cytometer (Becton Dickinson (BD),
Franklin Lakes, NJ). A blue solid-state laser (488 nm excitation)
and a 530/30 nm filter was used to measure eGFP. FCS files were analyzed
using Flowing Software v2.5.1 (Cell Imaging Core, Turku Centre for
Biotechnology, Turku, Finland). For flow cytometry sampling, the geometric
mean of the FITC-A fluorescence for 10 000 events was taken
as the “promoter activity”. Prior to sorting, the cytometer
was calibrated using Accudrop beads (BD) and SPHERO Rainbow calibration
particles (Spherotech, Lake Forest, IL).On the day of sorting,
2–4 mL of library frozen stocks were thawed, centrifuged to
remove excess glycerol, and used to inoculate 25 mL of MOPS minimal
medium supplemented with 0.4% (w/v) xylose, 100 μg/mL ampicillin,
and 0.1 mM IPTG. Cells were monitored hourly until they reached a
postrecovery state (∼5 h), as indicated by 85–90% of
cells expressing GFP. Cells were then dosed with 0 or 100 μM
formaldehyde and sorted 2 h later (Figure S8). The final promoter library in the NEB5α and ΔfrmR strains was sorted into eight gates with approximately
equivalent populations, and 1 000 000 events were collected
from each gate directly into LB medium. Populations were recovered
at 37 °C overnight and miniprepped for sequencing. Plating indicated
that approximately 70% of the sorted cells survived.
Next-Generation
Sequencing and Analysis
Multiplexed
sequencing libraries were constructed per the manufacturer’s
instructions with a Nextera DNA Library Preparation Kit (Illumina,
San Diego, CA). Pooled libraries were sequenced on a MiSeq desktop
sequencer (Illumina) using paired-end sequencing with a read length
of 2 × 201 bases at the University of Delaware DNA Sequencing
and Genotyping Center. Reads within each experiment and sorted population
were processed to remove those under 201 nucleotides and redundant
sequences. The number of mismatches between each read and the native
sequence (the hamming distance) was calculated, and reads with more
than approximately 17 mismatches were discarded. Over 3 million reads
met all of the quality standards and were used for further analysis.
Within each bin in an experiment, we calculated the frequency of each
base at each position from the aligned reads and divided it by the
total number of reads in the experiment to obtain the joint distribution f(b, μ)
in eq . The marginal
distributions, f(b) and f(μ), were calculated by summing
the joint distribution along the appropriate dimension. The correction
factor c(13) in eq was calculated as described
previously[12] with the number of possible
bases n equal to 4 and
the number of bins nμ equal to 7
for the ΔfrmR experiment and 8 for the induced
and uninduced NEB5α experiments. Sequence data are available
at https://www.ncbi.nlm.nih.gov/bioproject/383844.
Authors: Noah D Taylor; Alexander S Garruss; Rocco Moretti; Sum Chan; Mark A Arbing; Duilio Cascio; Jameson K Rogers; Farren J Isaacs; Sriram Kosuri; David Baker; Stanley Fields; George M Church; Srivatsan Raman Journal: Nat Methods Date: 2015-12-21 Impact factor: 28.547
Authors: Alexandre Melnikov; Anand Murugan; Xiaolan Zhang; Tiberiu Tesileanu; Li Wang; Peter Rogov; Soheil Feizi; Andreas Gnirke; Curtis G Callan; Justin B Kinney; Manolis Kellis; Eric S Lander; Tarjei S Mikkelsen Journal: Nat Biotechnol Date: 2012-02-26 Impact factor: 54.908
Authors: Katie J Denby; Jeffrey Iwig; Claudine Bisson; Jodie Westwood; Matthew D Rolfe; Svetlana E Sedelnikova; Khadine Higgins; Michael J Maroney; Patrick J Baker; Peter T Chivers; Jeffrey Green Journal: Sci Rep Date: 2016-12-09 Impact factor: 4.379
Authors: Xiangyang Liu; Sanjan T P Gupta; Devesh Bhimsaria; Jennifer L Reed; José A Rodríguez-Martínez; Aseem Z Ansari; Srivatsan Raman Journal: Nucleic Acids Res Date: 2019-11-04 Impact factor: 16.971
Authors: Rob Phillips; Nathan M Belliveau; Griffin Chure; Hernan G Garcia; Manuel Razo-Mejia; Clarissa Scholes Journal: Annu Rev Biophys Date: 2019-05-06 Impact factor: 12.981
Authors: Manuel Razo-Mejia; Stephanie L Barnes; Nathan M Belliveau; Griffin Chure; Tal Einav; Mitchell Lewis; Rob Phillips Journal: Cell Syst Date: 2018-03-21 Impact factor: 10.304