We present here a newly developed workflow─which we have called PASIV─designed to provide a solution to a practical problem with design of experiments (DoE) methodology: i.e., what can be done if the scoping phase of the DoE cycle is severely hampered by burden and toxicity issues (caused by either the metabolite or an intermediary), making it unreliable or impossible to proceed to the screening phase? PASIV─standing for pooled approach, screening, identification, and visualization─was designed so the (viable) region of interest can be made to appear through an interplay between biology and software. This was achieved by combining multiplex construction in a pooled approach (one-pot reaction) with a viability assay and with a range of bioinformatics tools (including a novel construct matching tool). PASIV was tested on the exemplar of the lycopene pathway─under stressful constitutive expression─yielding a region of interest with comparatively stronger producers.
We present here a newly developed workflow─which we have called PASIV─designed to provide a solution to a practical problem with design of experiments (DoE) methodology: i.e., what can be done if the scoping phase of the DoE cycle is severely hampered by burden and toxicity issues (caused by either the metabolite or an intermediary), making it unreliable or impossible to proceed to the screening phase? PASIV─standing for pooled approach, screening, identification, and visualization─was designed so the (viable) region of interest can be made to appear through an interplay between biology and software. This was achieved by combining multiplex construction in a pooled approach (one-pot reaction) with a viability assay and with a range of bioinformatics tools (including a novel construct matching tool). PASIV was tested on the exemplar of the lycopene pathway─under stressful constitutive expression─yielding a region of interest with comparatively stronger producers.
Entities:
Keywords:
construct matching; design of experiments; pooled approach; scoping
To
optimize the output of a metabolic pathway, there are many design
parameters that one can vary when engineering the pathway, such as
the variation of its coding sequences, the engineering of expression
levels by gene dosage, or transcriptional and translational engineering.[1] This results in a very large combination of design
parameters. Combinatorial design can be achieved through different
means, either brute force where all of the possible combinations are
made, which is limited by the number of combinations and permutations
of parts from the metabolic pathway, also known as the problem of
combinatorial explosion or by applying combinatorial strategies.[2] When optimizing a metabolic pathway with a combinatorial
approach, one of the main issues is scale and how to deal with large
design spaces. For example, consider the violacein pathway (five genes)[3] and an operon design—yielding 105 possible combinations for even a small library of 10 RBS. Varying
more components in the design (e.g., the promoter driving the operon
or adding degradation tags to the enzyme to control their expression)
expands the design space by further orders of magnitude and easily
reaches millions of potential combinations. Such large spaces then
become prohibitively expensive and difficult to investigate with standard
construction and analysis methods.It is desirable to gain as
much information as possible from a minimal number of experiments.[4] In recent years, design of experiments (DoE)
has become a popular method[5] and is used
for different applications to design and optimize synthetic biology
systems. DoE techniques are employed for protocol development such
as optimizing media and culture conditions for a maximum yield of
the metabolic pathway,[6,7] optimizing a transformation protocol,[8] or the optimization of a cell-free system.[9] DoE techniques have also been applied recently
to the optimization of metabolic pathways. Cis-regulatory elements
in pathways such as promoters and RBS can be selected to optimize
their outputs. Recent advances in metabolic engineering have made
use of DoE techniques for pathway simplification.[10] Others leverage the combination of DoE and high-throughput
automation platform by developing an automated design–build–test–learn
(DBTL) cycle[11] or utilizing full-featured
DNA foundry platforms.[12]The standard
theoretical textbook DoE workflow as described by
Gilman et al.[5] is grouped in three categories
of iterative experiments: scoping, screening, and optimization. The
scoping experiment is used to identify the region of interest. The
screening experiments identify the most significant factors, and finally,
the optimization stage consists of fine-tuning the factors to optimize
the response.In this work, we focus on the scoping phase of
the DoE cycle when
applied to the optimization of the production of a metabolite over
a large combinatorial design space of constructs. In particular, we
focus on the case when finding an adequate region of space to explore,
screen, and optimize over is difficult due to toxicity and burden
issues. To this end, we have developed a novel scoping approach, which
we called pooled approach, screening, identification, and visualization
(PASIV).
Lycopene Exemplar
For the development of the novel
approach, we have used a popular pathway in metabolic engineering:
the lycopene pathway. To create the necessarily challenging conditions,
we will consider its production under constitutive expression.Lycopene, a naturally produced bright red pigment, is a carotenoid
present in many plants and organisms[13] and
is of high bioeconomic interest for the pharmaceutical (thanks to
its antioxidant properties) and cosmetic and food industries (as a
coloring agent).[14,15]The demand in the industry
for carotenoids such as lycopene in
medical and pharmaceutical applications keeps increasing,[16−18] but its chemical synthesis is limited by high cost, low yield, and
quality.[19] Microbial lycopene production
is an alternative promising strategy and has been developed in various
hosts such as yeast,[20,21]Bacillus subtilis[22] or Yarrowia lipolytica,[23] and Escherichia coli.[16,24−26] The lycopene synthetic pathway (Figure A) comprises three successive
enzymatic reactions catalyzed by the enzymes crtE, crtB, and crtI.[27,28] Such simplicity, coupled with the fact that all of the associated
regulatory mechanisms have been characterized in several strains,
makes the lycopene pathway an appealing exemplar case study for optimization
problems in metabolic engineering. Lycopene production can also be
crippled by metabolic burden (strong production of lycopene and intermediaries
take away resources otherwise used for cellular growth)[29] and toxicity—lycopene having been proven
to be toxic to the cell[15,30] due to its accumulation
in the cellular membrane.
Figure 1
Three-gene lycopene pathway and its implementation
with an operon
design. (A) The lycopene pathway is made up of three enzymes (crtE,
crtB, and crtI). (B) The design includes these three genes arranged
in an operon pattern. All gene orders are permitted. A constitutive
promoter drives the operon, and an insulating RiboJ is inserted post
promoter.
Three-gene lycopene pathway and its implementation
with an operon
design. (A) The lycopene pathway is made up of three enzymes (crtE,
crtB, and crtI). (B) The design includes these three genes arranged
in an operon pattern. All gene orders are permitted. A constitutive
promoter drives the operon, and an insulating RiboJ is inserted post
promoter.The construct design used in this
study will be the design already
used in Exley et al.[31] in the context of
the use of well-characterized parts for DoE workflows. The construct
design (see Figure B) is based on the following principles:An operon design is used as it is a common design feature
in nature and in synthetic biology. It also reduces the design space
compared to a design where each gene has its own transcription unit
and reduces the chances of homologous recombination if promoters are
reused.A constitutive promoter drives
the operon. Five constitutive
promoters are chosen from the SynBIS library[32] to span a range from weak (K137085) to strong (Kelly’s reference
promoter J23101)[33] to very strong (BioFab
promoter apFab32)[34]—see Table S1 for their relative strengths.An insulating element RiboJ[35] is inserted between the promoter and first UTR-RBS as a
direct consequence
of using the BASIC assembly method and its library parts.All three RBS in the operon can be varied,
as per Blazeck
et al.[36] and Salis et al.[37] For each position, three RBS parts from a subset of the
Biolegio library (referred to as RBS1, 2, and 3 in the rest of this
paper) are used.Finally, the design
does not fix the gene order and
allows for permutation instead—thus permitting a reinvestigation
of previous studies such as Nishizaki et al.[38] with the carotenoid pathway in E. coli (and yeast), which demonstrated the influence of the said gene order.
In the rest of this article, constructs are labeled according to their
gene order (e.g., BEI is labeled 1)—see Table S2.These original choices
by Exley et al.[31] lead to a design space
of 810 promoter/RBS/gene order combinations
(6 permutations of the gene order, 5 promoters, and 3 RBS in 3 positions).
The size of the design space (810 constructs) makes this combinatorial
optimization problem an ideal test bed for their integration of automation
and software, as it is small enough to build all combinations using
standard modular plasmid construction methods such as BASIC assembly,[39] vehicle being large enough to employ DoE and
for any developed to be capable of scaling to larger, more complex
problems.
Failures and Failure Modes
The design space in our
work follows that of Exley et al.[31] As
a preliminary study for this work, the same initial DoE workflow as
Exley et al.[31] was implemented.An
initial set of 8 constructs out of the 88 random constructs selected
by DoE software in that study were assembled and transformed into E. coli. The transformants were plated and left for
growth in an attempt to isolate some colonies for a further culture
to extract and measure the produced lycopene. It was then observed
that the growth of the transformed colonies was either nonexistent
or very slow and the first colonies observed took as long as 72 h
to grow. To further investigate the reason for the observed growth
discrepancy, the colonies that grew were picked and sent for Sanger
sequencing. Two sets of issues were identified. Sequencing results
showed that many operon genes had either been deleted or mutated and
construct parts could not be identified—showing a propensity
of the cells to mutate or get rid of the operon. A significant number
of nonexpressing lycopene colonies were observed growing on top of
lycopene-expressing ones and outcompeting them on the Petri dish.These failures can be attributed to two biological factors: metabolic
burden (strong production of lycopene and intermediaries take away
resources otherwise used for cellular growth)[29] and toxicity—lycopene having been proven to be toxic to the
cell[15,30] due to its accumulation in the cellular
membrane. Practically, these failures are compounded by two sources
of technical failures—experimental failures and assembly failures.In the context of a DoE cycle, burden and toxicity are major obstacles
to an efficient, reliable, and reproducible investigation of the design
space; in the worst cases, they are stumbling blocks. In general,
burden leads to discrepancy in cellular growth, which is itself a
major hindrance to any effort to operate colony picking since colonies
cannot be picked within the same time frame—thus reducing the
throughput of the entire operation, whether conducted manually or
with automation. In titration assays, growth discrepancy is also a
significant source of variation in the data. Toxicity (and mutations)
will restrict the viable portion of the design space—more or
less severe depending on the capacity of the chassis to tolerate the
said toxicity.[40] Practically, cells containing
constructs lying outside the said viable region will fail to grow.Debugging individual failures—i.e., identifying the reason
for a failure to transform and grow a colony is, in general, cumbersome—for
instance, transformation failures cannot be observed immediately after
the act itself but only later in the process. Separating failure modes
at population levels is much easier, however, since they are distributed
very differently over the design space (as illustrated in Figure ):
Figure 2
Failure
modes affecting the scoping round. Three types of failures
are considered in this work. (A) Randomly distributed assembly failures.
(B) Assay failures (also randomly distributed). (C) Toxicity-induced
failures (highly correlated).
Assembly failures are sparsely occurring
events (Figure A)—common
modular assembly methods are very effective when short pathways are
assembled as is the case for the lycopene exemplar and can be assumed
independent from the construct (see the discussion on our solution
PASIV).Assay failures are uniformly
distributed: assay/experimental
failures are independent of the construct; their frequency depends
on the skills of the experimentalist (or the level of automation)
and the protocol adopted (Figure B).Toxicity and burden
issues are NOT independent of the
construct. The constructs encode a set of enzymes, and the recorded
output (viability, OD, concentration of a metabolite) depends on the
concentration of enzymes that are produced. When toxicity and burden
are too low to affect the constructs, the entire design space is,
a priori, viable (Figure C, left). Conversely, when toxicity and burden are issues—only
contiguous portions—the design space is viable (Figure C, right).Failure
modes affecting the scoping round. Three types of failures
are considered in this work. (A) Randomly distributed assembly failures.
(B) Assay failures (also randomly distributed). (C) Toxicity-induced
failures (highly correlated).
Bootstrapping the DoE Cycle
Typically, the scoping
phase of a DoE cycle consists of the random draw of a comparatively
small subset of the design space, to be followed by successive, targeted,
draws based on data analysis of the collected data, and finally an
optimization phase once a narrow-enough region of interest has been
identified. Figure A shows what can be expected at the end of the scoping study for
two extreme cases—when the whole design space is unaffected
by toxicity and burden issues (nontoxic case—Figure A, left) and when toxicity
and burden issues are severe but in a small region of the design space
(toxic case—Figure A, right). In the nontoxic case, it is possible to transform,
grow, and assay for the vast majority of the initial constructs—failures
can be solely attributed to technical issues—and enough data
can be collected to proceed to the targeted rounds. The scoping round
works as intended. Conversely, in the toxic case (for instance, the
exemplar lycopene study), a few of the constructs that are initially
drawn lie in this viable region. Failures may also be due to assembly
or experimental issues rather than burden or toxicity. Overall, only
a very few data can be collected at the end of this scoping phase—too
few to analyze and proceed to the later phases of the DoE cycle. The
scoping round fails.
Figure 3
Expected outcome to a scoping round in the nontoxic and
toxic cases
and for two different sets of input constructs. (A) A subset generated
by a random draw from the design space is used as input. (B) The whole
space is used. In the toxic case, colonies will only grow for constructs
located in the viable region. Only by using a dense input data set
will the viable region reveal itself by continuity—this simple
remark is the foundation of the PASIV method.
Expected outcome to a scoping round in the nontoxic and
toxic cases
and for two different sets of input constructs. (A) A subset generated
by a random draw from the design space is used as input. (B) The whole
space is used. In the toxic case, colonies will only grow for constructs
located in the viable region. Only by using a dense input data set
will the viable region reveal itself by continuity—this simple
remark is the foundation of the PASIV method.Figure B shows
the expected outcome of a scoping round if the whole design space
was used as input. In the nontoxic case, most colonies corresponding
to the input constructs will grow and yield data—as before
(the few) failures can be assigned to assay or assembly issues. In
the toxic case, colonies will only grow for constructs located in
the viable region. The viable region reveals itself by continuity
when successful assays are mapped in the construct space. Using the
whole space as input is of course totally impractical—and implementing
a brute force approach is antithetic to the idea of DoE. It is possible
and tractable, however, to randomly sample large sets using multiplex
construction in a pooled approach (PA). The high density of the input
data set will then still allow us to exploit the different distributions
of the failure modes and identify the continuous regions.We
present in the next section a novel workflow—coined PASIV—that
implements this idea and, through an interplay between biology (multiplex
construction in a pooled approach) and software (a range of bioinformatics
tools including a novel construct matching tool to overcome issues
inherent to pooled approaches), offer a novel way to perform the scoping
phase of the DoE cycle and identify the (viable) region of interest.
Results and Discussion
PASIV Approach
The acronym PASIV
stands for pooled
approach, screening, identification, and visualization—the
four distinct phases of the workflow. PASIV is also, as its name implies,
a passive method, as the viable region of the design space is allowed
to appear through the viability assay (unlike the more active standard
approach of drawing random samples and inferring the regions of interest
from them). The PASIV approach comprises four sequential phases (illustrated
in Figure ).
Figure 4
PASIV workflow—summary.
PASIV is made of four successive
phases. Phase 1, a construction step using modular assembly in a pooled
approach and aiming at building all constructs at once. Phase 2, a
viability assay consisting of a transformation and culture step followed
by a colony picking step (the assay may be followed by a titration
assay where metabolite production by the yet unidentified colonies
is measured). Phase 3, an identification step where the genetic content
of the picked colonies is identified. Phase 4, a visualization/analysis
step that combines all collected information.
PASIV workflow—summary.
PASIV is made of four successive
phases. Phase 1, a construction step using modular assembly in a pooled
approach and aiming at building all constructs at once. Phase 2, a
viability assay consisting of a transformation and culture step followed
by a colony picking step (the assay may be followed by a titration
assay where metabolite production by the yet unidentified colonies
is measured). Phase 3, an identification step where the genetic content
of the picked colonies is identified. Phase 4, a visualization/analysis
step that combines all collected information.
Phase
1: Construction with a Pooled Approach (PA)
Instead
of the targeted construction of a set of constructs (drawn at random),
multiplexed construction via a pooled approach is implemented. The
purpose of this stage is to build all possible constructs simultaneously
in a fast and effective manner. Practically, it is expected to generate
a large number of constructs from across the entire design space—with
no obvious bias (large gaps, different relative frequencies). A pooled
approach helps ensure that the same experimental conditions apply
to all of the constructs that are built, thus minimizing one of the
potential sources of errors previously listed.The BASIC assembly
method[39] is used for the modular assembly
of the constructs in a one-pot reaction, where all varying parts (promoters,
RBS, etc.) are mixed in equal quantities to achieve a purely combinatorial
assembly. When several gene orders are investigated, it is necessary
to operate several pools—each corresponding to a given gene
order. In the lycopene exemplar, six pools that correspond to the
six possible orders of gene combinations (crtE, crtB, and crtI) were
created. BASIC assembly relies on standard 21 base-pair overhangs
and 12 base-pair adapters for ligation and to drive assembly toward
the specified constructs. The overhangs are highly specific, and their
efficiency is assumed to be unaffected by the upstream and downstream
sequences (to the best of the authors’ knowledge, no results
have been published to indicate such an effect at the time of publication
of this paper).
Phase 2: A Viability Screening (S) Assay
The cells
are transformed with the genetic pool in another one-pot reaction.
The transformed cells are then plated and grown for the selection
of viable candidates. Viable candidates are then picked up from the
plate(s) and sent for sequencing. Selection should be at random and,
in particular, should not be affected by the apparent production (as
assessed by the color of the colonies)—ideally, all colonies
should be picked to reduce possible biases. A titration assay can
then be run with the viable candidates to collect data on metabolite
production. This second assay collects standard titration features
such as the optical density (either as a time series or as an end
point of assay) and the concentration of the metabolite of interest
and intermediaries if available (also either as a time series or as
an end point of assay).
Phase 3: An Identification (I) Phase
Since the cells
have been transformed with a mix containing all possible plasmids,
there is little prior knowledge of the genetic content of the picked
colonies—except that it matches a given template, was assembled
from a known set of components, and therefore lies in a, possibly
large, space of possibilities. There is also no certainty that all
constructs will feature among the picked colonies, while some may
be represented several times. This identification phase is crucial
to the overall success of the method—without it, no consequent
data visualization and analysis are possible. Identification is achieved
using a construct matching software called cMatch (see below) that
was developed as part of PASIV—and that is now routinely used
in-house for other applications such as quality control of the modular
assembly. cMatch analyzes sequences and identifies the closest member
of a design space corresponding to a given template and a set of component
libraries—or returns an error if the sequences are of poor
quality or too distant from the constructs in the design space.
Phase 4: A Visualization (V) Phase
Finally, the collected
data (colony content and assay measurements) are combined, analyzed,
and visualized. The goal of this stage is the localization of the
viable region—that is, the region least affected by toxicity
and therefore eligible for titer optimization.Within PASIV,
metabolite production is treated with a black-box approach. The constructs
in the design space are seen as an enzyme producer. Enzyme production
is itself viewed as a multidimensional stimulus to the chassis, which
is controlled by tuning construct components (promoter, RBS, etc.).
The bacterial chassis has its own limited resources, feedback systems,
including resistance to toxicity, and capacity for producing a metabolite
of interest. No attempt is made to model these processes; instead,
we rely on the data to estimate the dose–response relationship.
After data combination, all constructs in the design space are assigned:A multiplicity order: the number
of times the construct
has been identified among the picked colonies—this is assumed
strong evidence of whether the colony is viable and/or subject to
burden issues.The data collected during
the titration assay (if any).
To distinguish between the different sources of biological noise,
they are aggregated at two levels—by colony and for each colony
by assay sample (see the Results and Discussion section). Titration data can also be used to further restrict the
viable region.PASIV uses a minimal set
of coordinates corresponding to estimates
of the synthesis rates of all products encoded by the construct (the
enzymes crtE, crtB, and crtI in the case of the lycopene example).
These coordinates—called “modeling coordinates”
in the rest of this work as they are based on the modeling of protein
production—correspond to a representation of the stimulus amplitude.
Modeling coordinates can be used as universal coordinates for constructs
encoding the same functions (e.g., enzyme production)—and thus
grant comparisons between constructs with different gene orders or
based on different patterns (operon vs transcription units). They
also allow for the deployment of the most common machine learning
preprocessing techniques such as dimension reduction, distribution
normalization/standardization, and outlier correction if needed (for
instance, in high-dimensional problems). Crucially, metabolic burden
and toxicity effects are correlated to the amplitude of the stimulus—–the
higher the stimulus, the higher the burden and in the case of lycopene,
the more lycopene is produced, the higher the toxicity for the cells.
This makes such a set of coordinates ideal to differentiate toxicity
effects from construction and experimental issues.
Construct Identification
with cMatch
PASIV uses software
called cMatch that has been developed for the purpose of automating
quality control in (high-throughput) workflows encountered in combinatorial
pathway approaches to metabolic engineering. Thanks to the way it
performs quality control (detailed below), cMatch can be applied to
the more general problem of the identification of synthetic constructs
lying in a specified design space without having to perform a brute
force search for the best match in that design space.cMatch
deals with identification problems of plasmid-level complexity: constructs
are expected to include a few genes (10 at most), while sequences
to analyze will only be a few kbp-long (15 kbp at most). Furthermore,
information will not be evenly distributed in the sequences. The biological
functions encoded in the constructs will have sequences spanning several
orders of magnitude (CDS are 1 kb or more, promoters around a 100
bp, UTRs around 50 bp or less, while degradation tags are often less
than 10 bp long). At the same time, very short sequences (regulatory
components, as well as short adjacent sequences to functional components)
may massively affect the output to the construct[30,41,42] and will need to be identified with the
utmost precision—in contrast to applications such as barcoding,[43] where short barcode sequences (with a space
of possibilities in the billions) are used to identify constructs
in a design space of several thousands/millions.To deal with
such features and constraints, cMatch adopts a very
different strategy to recent annotation tools such as pLannotate,[44] which rely on BLAST[45] for matching and do not use any positional and combinatorial information.
First, cMatch uses the highly validated Smith–Waterman algorithm,[46,47] instead of BLAST,[48] which is better suited
to genome length sequences. This choice was born out of the need for
a high level of precision and the relatively short length of the sequences
to analyze. Second, cMatch heavily exploits positional and combinatorial
information—which also happen to be the only pieces of information
available on the genetic material when pooled construction workflows
are used, namely:The template,
describing their modular structure (component
types, relative order, interfacing constraints) and providing the
structural constraints for the space.For each component of the template, a library of admissible
elements—providing a description of the combinatorial constraints
for the space.Rather than using a brute
force approach to search the design space
for the constructs that best match an input sequence, cMatch performs
a three-level process. cMatch first searches for the individual components
listed in the template and then generates all admissible combinations.
Finally, the combinations are ranked. The matching (homology) scores
generated at both levels of the process are not only used to rank
the admissible combinations but also automate its decision-making
and quantify the reliability of the prediction. cMatch’s (core)
algorithm practically proceeds, as shown in Figure A:
Figure 5
Main features of the construct matching software cMatch.
(A) The
core algorithm cMatch used for construct matching is made of three
steps. Step 1: cMatch looks for all possible individual components
in the input sequence. Poor matches (low scores, in red) are pruned
out. Step 2: the list of all admissible constructs is generated. Dynamic
reconstruction was implemented. A construct matching score is generated
from the individual component matching scores for all possible constructs.
Step 3: constructs with the highest scores are returned. (B) cMatch
uses two types of inputs: a JSON template encoding the positional
and combinatorics constraints for a given search space and the sequence
data to analyze (single or multiple input). Single inputs are processed
with the core algorithm CM_1 detailed above; multiple inputs by an
extension CM_2. cMatch returns a JSON file listing the best match,
as well as a ranking of all admissible constructs.
Step 1—component matching: this looks for the
individual components in the input sequence—looping over all
libraries of components to match each of their elements to a subsequence
of the input sequence—assigning a matching score (normalized
by the length of the component) and a position (or several) in the
process. Pruning follows: only matches with a score above a user-specified
threshold are kept.Step 2—reconstruction
and pruning: the list of
all admissible constructs is generated. Rather than using a purely
combinatorial approach with a combinatorial product of the results
of the first step, followed by pruning against the template, an iterative
reconstruction has been implemented. Combinatorial recombination proves
extremely costly indeed when the input contains repetitions and elements
are detected in several locations—expanding the number of combinations
by several orders of magnitude. Instead, positional constraints are
applied as early as possible to prune out entire branches of the reconstruction
tree.Step 3—output: all admissible
constructs are
assigned a global score (the geometric mean of their component scores).
The combinations with the highest scores are returned as best matches.Main features of the construct matching software cMatch.
(A) The
core algorithm cMatch used for construct matching is made of three
steps. Step 1: cMatch looks for all possible individual components
in the input sequence. Poor matches (low scores, in red) are pruned
out. Step 2: the list of all admissible constructs is generated. Dynamic
reconstruction was implemented. A construct matching score is generated
from the individual component matching scores for all possible constructs.
Step 3: constructs with the highest scores are returned. (B) cMatch
uses two types of inputs: a JSON template encoding the positional
and combinatorics constraints for a given search space and the sequence
data to analyze (single or multiple input). Single inputs are processed
with the core algorithm CM_1 detailed above; multiple inputs by an
extension CM_2. cMatch returns a JSON file listing the best match,
as well as a ranking of all admissible constructs.Focusing on the individual components of the constructs and
then
on their combination is an efficient way to investigate large combinatorial
spaces and control combinatorial explosion, as it exploits the high
level of similarity among constructs in the same design space to flatten
the design space and reduce the number and complexity of the computations.
Even with the simple example of the lycopene operon, the power of
the approach is apparent. There are 810 admissible combinations in
the design space (101 250 with the entire Biolegio library),
but the 9 libraries only contain 18 elements (1 library of 5 promoters,
3 libraries with 3 RBS each, 3 libraries with 1 CDS each, and 1 library
with 1 terminator)—54 with the entire Biolegio library. Positional
constraints are powerful pieces of information in general. There are
9!/3! = 60 480 as many ways to arrange the components, i.e.,
almost 50 millions (and more than 5 billions with the entire Biolegio
library) when their order is not imposed. In the context of reconstruction,
the positional constraints can be applied very efficiently to accept/reject
combinations since the relative positions of the components can be
simply inferred from the positions of the components as returned during
the first phase (component matching).As it operates at both
component and construct levels, cMatch performs
a more thorough operation than simple sequence alignment. This operation,
called “construct matching”, returns (and quantifies)
a description of the modular structure of the construct, including:Structure matching: Does the construct
match the design?
Are all components present? Do their order and interfacing match the
construct design?Quality of the matching:
How close is the matching?
Are there any close constructs? Are any components affected by mutations?
Are there any insertions or deletions? Matching (homology) scores
are generated for all components and constructs.It is worth emphasizing that such a level of analysis is only
possible
thanks to the use of the template and knowledge of the elements in
the component libraries. Likewise, the identification of a construct
would also be intractable without these pieces of information.cMatch uses two types of inputs: a JSON template encoding the positional
and combinatorics constraints for a given search space and the sequence
data to analyze (Figure B). Sequence data can be either single-input, long-read data (as
with next-gen sequencing) or multiple-input, short-read data (typical
of Sanger sequencing, which was the case in this study). Two algorithms,
the core algorithm CM_1 detailed above and an extension CM_2 have
been developed—each tailored for a type of input (CM_1 for
single input, CM_2 for multiple input). cMatch returns a JSON file
listing the best match, as well as a ranking of all of the constructs
in the search space (or an error log), and all intermediary results.cMatch has been implemented in Python 3.9 and is publicly available
as an open-source package on the Kitney Lab Github page (https://github.com/kitneylab) under MIT license (https://choosealicense.com/licenses/mit/). The core functionalities are implemented as three different modules:
matching.py, reconstruction.py, and extension.py, which, respectively,
implement the core sequence, component libraries and component classes
and their matching methods (calling biopython pairwise2 local alignment
function), and the reconstruction and extension functions. All input
and output files are in JSON (for simplicity) except the sequence
files.
Application of PASIV to the Lycopene Exemplar
We now
illustrate how the general PASIV workflow can be used in practice
with the lycopene exemplar. Three of the four phases of the workflow
were customized as follows (the third phase, identification, is completely
independent of the application and need not be customized).
Phase 1:
Pooled Approach
The BASIC assembly method[39] is used for the modular assembly of the constructs
in a one-pot reaction. As previously mentioned, when several gene
orders are investigated, it is necessary to operate several pools—one
for each gene order. This is of course, primarily, to ensure no combination
lying outside the design space (constructs with multiple repeats of
some CDS and missing CDS) is constructed. In the context of lycopene,
where toxicity and burden are the most limiting factors, it is crucial
to the success of the methods as constructs missing one of the pathway
enzymes will exhibit a significant growth advantage (no lycopene made)
and will therefore be overwhelmingly represented at the screening
stage.Six different pools—each corresponding to one
of the six possible gene combinations for crtE, crtB, and crtI—were
created. For each of these pools:The other varying parts (promoters, RBS) are mixed in
equal quantities.The cells are transformed
with these genetic pools in
a one-pot reaction.The transformed cells
are plated and grown for the selection
of viable candidates.
Phase 2: Screening of the
Viable Candidates
Viable
and lycopene-expressing colonies are selected and picked. Practically,
the transformed cells that have grown on the Petri dish and that express
the characteristic orange color hue phenotype of lycopene expression[49] are selected and colonies are picked and grown
in liquid culture for 24 h for further analysis: lycopene extraction
and measurement, and sequencing. Lycopene is extracted with dimethyl
sulfoxide (DMSO) (see the Methods section)
from the liquid culture of each of the viable candidates. OD600 and absorption at 471 nm, the characteristic wavelength of lycopene,[50] are measured, and the lycopene yield is estimated.
Phase 4: Derivation of the Modeling Coordinates
In
the lycopene study, the production of each enzyme can be modeled with
the standard constitutive expression modelwhere K1 is the
effective transcription rate, K2 is the
effective translation rate, dm is the
effective mRNA degradation rate, and dE is the effective enzyme degradation rate. The synthesis rate is K1K2/dm, and the corresponding steady-state concentration is K1K2/dmdE.Since visualization,
clustering and other operations conducted in phase 4, can be conducted
up to a multiplicative constant, not all quantities need to be estimated
and those who do can themselves be estimated up to a multiplicative
constant.The degradation rates of the three enzymes crtE, crtI,
and crtB
are constants of the problem. Thanks to RiboJ’s protection
of mRNA at the 5′ end, the mRNA degradation rate can also be
assumed a constant of the problem—independent of the gene order,
albeit hard to estimate. It was therefore decided to use K1K2, which is the product
of the effective transcription rate and the effective translation
rate for the effective synthesis rate, instead of the exact K1K2/dm to represent the synthesis rate. All synthesis rates
used in the rest of this paper are henceforth expressed in arbitrary
units due to this simplification.With the lycopene operon used
in this study, the effective transcription
rate K1 for each gene is estimated from
the characterization data of the leading promoter (Table S1). All values are expressed in relative promoter unit
(RPU, a relative unit of strength common in promoter characterization,[33] which has been estimated at approximately 0.02
RNAP/s/promoter).[51] The use of the insulating
element RiboJ[35] ensures that transcription
and translation are decoupled from each other and these estimated
promoter strengths could be used for the first gene of the operon.
Effective transcription rates for the other positions were derived
by multiplying its strength by a factor depending on its position
down the operon, as suggested by Nishizaki[38] whose experimental results proved that “mRNA abundance decreased
by roughly half from one gene to the next”.The decoupling
of transcription and translation is an example of
favorable compositional context—unfortunately, no such simplification
is possible for the estimation of the effective translation rate.
Context is the set of interrelated factors that modulate the operation
of biological processes. These factors are traditionally grouped as
composition-specific, environment-specific, and host-specific.[52] In general, the expression of functional components
is affected by short adjacent upstream/downstream sequences.[41,42,53] Calculator tools such as the
RBS calculator (https://www.denovodna.com/)[37,54] now require 35 bp upstream and 60 bp downstream
to estimate the translation rate of an RBS.No attempt was made
at experimentally characterizing the RBS in
context. The effective translation rate K2 was estimated with the RBS calculator instead for all of the RBS
in the Biolegio library and for all possible contexts with the operon
design (upstream and downstream sequences from the flanking genes
or RiboJ—listed in the Supporting files Pool 1 - BEI; Pool 2 - BIE; Pool 3 - EBI; Pool 4 - EIB; Pool 5 - IBE; and Pool 6 - IEB). Corresponding estimations can be found
in the Supplementary Information and provide
a vivid illustration of the influence of context on RBS expression—not
only do translation rates can vary by as much as 3 orders of magnitude
indeed but the ranking orders between RBS in the library are also
not conserved when context changes. Such unpredictability contrasts
sharply with transcription, which varies with gene order in a predictable
manner by a factor of up to 4.
Preliminary Investigation
of the Design Space
The first
immediate application of the set of coordinates was to conduct a preliminary
investigation (Figure ) of the design space generated by the operon design and for the
entire BASIC Biolegio library and the chosen subset of {RBS1, RBS2,
RBS3}. For all visualizations in this paper, the synthesis rates (synthesis(crtE),
synthesis(crtB), synthesis(crtI)) are used as coordinates. We will
refer henceforth to this set of coordinates as “synthesis coordinates”
and use the shorthands E, B, and I for the enzyme names crtE, crtB, and crtI, respectively.
Figure 6
Design
space generated from the entire RBS Biolegio library and
the subset {RBS1, RBS2, RBS3}. Constructs are color-coded according
to their gene order/pool. (A) The design space generated from the
entire Biolegio library has 101 250 constructs and is heavily
concentrated by the E–B plane.
(B) The design space generated from the subset {RBS1, RBS, RBS3} only
includes 810 constructs ((B), left) and covers a region closer to
the origin. A zoom onto the origin ((B), right) shows a dense sampling
of this very weak region.
Design
space generated from the entire RBS Biolegio library and
the subset {RBS1, RBS2, RBS3}. Constructs are color-coded according
to their gene order/pool. (A) The design space generated from the
entire Biolegio library has 101 250 constructs and is heavily
concentrated by the E–B plane.
(B) The design space generated from the subset {RBS1, RBS, RBS3} only
includes 810 constructs ((B), left) and covers a region closer to
the origin. A zoom onto the origin ((B), right) shows a dense sampling
of this very weak region.Using the entire 15-strong Biolegio RBS library yields a design
space made of 101 250 constructs (5 promoters, 15 RBS in position
1, 15 RBS in position 2, 15 RBS in position 3, and 6 permutations
for the gene order). The corresponding design space (Figure A) is highly anisotropic. Constructs
exhibit a larger density close to the origin, along the E and B axes, and are also overwhelmingly located
close to the E–B plane—this
is expected as it corresponds to low crtI production and is a direct
consequence of the blocking effect the crtI coding region has on RBS
upstream of it. This was verified by conducting a codon optimization
of the crtI sequence (https://eu.idtdna.com/pages/tools/codon-optimization-tool) and comparing the estimated translation rates for both crtI and
its codon-optimized sequence. Results (see the Supporting Information) show an increase by an order of magnitude
with the codon-optimized sequence. A wide region by the origin ([0,25 000]
× [0,25 000] × [0,1000] in E–B–I coordinates) was densely sampled
by the design space, while regions further and further from the origin
were less and less so.Influence of the gene order can be glimpsed
thanks to the color
scheme used in the figure. Separate visualization of the constructs
by gene order (see the Supporting Information) shows that constructs tend to be located in different sections
of space depending on their gene order—suggesting a possible
reason why some gene orders are associated with better production
and yield: over-representation in production sweet spots. Separate
visualization of the constructs by promoters (Supporting Information) shows that the dense regions are the
results of not only the expression of the RBS in different contexts
and modulated transcription but also the weighting by promoters of
different strengths.Figure B shows
the 810-strong design space used in the constitutive study (5 promoters,
3 RBS in position 1, 3 RBS in position 2, 3 RBS in position 3, and
6 permutations for the gene order). The design space generated with
this subset is much sparser as it is reduced by a factor of more than
a 100 (810 instead of 101 325). It also exhibits a truncated
coverage as coordinates cannot go above 30 000 along the E-axis, and coordinates do not exceed 1000 along the I axis (Figure B, left). A zoom on the region close to the origin (Figure B, right), corresponding
to the weakest constructs, producing the enzymes in little quantity
and thus expected to produce little lycopene, shows that the region
remains densely (in the E–B plane) but only contains very low values along the I coordinates due to the blocking effect of crtI. Preliminary results
obtained while attempting to reproduce Exley et al.[31] have shown how difficult it is to obtain results even for
weak constructs; therefore, it was decided to stick to the original
{RBS1, RBS2, RBS3} subset and not try to extend the design space with
stronger (in context) RBS. Further post hoc justification for this
choice of the reduced RBS library was supplied by the experimental
results presented in the next section, showing that all constructs
yielded by the pooled approach were located well within this region
of weak constructs.
Screening Results
In phase 2 of
the PASIV workflow,
viable colonies are picked to be sequenced and cultured for a titration
assay. Colony picking was conducted irrespective of the color of the
colony. Only a few colonies were present—confirming previous
attempts at cultivating the E. coli for lycopene production and how difficult it is. Overall, 99 colonies
(from pools 1 to 6) were isolated (see Table S3). Pools 3 (EBI; 37/99) and 4 (EIB; 54/99) were overwhelmingly represented
among these isolates as these pools yielded the most colonies for
picking. Pools 1, 2, and 6 yielded very few viable colonies (all were
picked), while pool 5 yielded none.All picked colonies were
sent for Sanger sequencing (see the Methods section for details on the primers). Returned sequences were then
analyzed with cMatch to identify each of the components of the constructs
(promoters, RBS, CDS) and reconstruct them. Eighty-seven of these
constructs were successfully sequenced. Seventy-three out of the sequenced
87 isolates could be identified with sufficient reliability—that
is, they met the minimum quality requirements for construct matching
(each component could be matched with a sufficiently high score).
Most of the identified isolates proved to be duplicates, and, in the
end, only 23 constructs were unique. Such a low number was anticipated
since preliminary experimental results hinted at a very limited viable
region for the design space and toxicity and burden effects having
very noticeable effects on visible outcomes (cell viability and lycopene
production).Table lists these
23 distinct constructs, their components, and their multiplicity among
the 73 reconstructed sequences. Instead of listing construct components
in template order (promoter, RBS in position 1, CDS in position 1,
etc.), they are ordered in (promoter, RBS_E, RBS_B_RBS_I, gene order)
order, where RBS_E (respectively, RBS_B, RBS_I) is the RBS located
in front of the CDS for crtE (respectively, crtB, crtI). This was
done to make the results easier to compare across gene orders and
also to match the idea of modeling coordinates, which is independent
of the gene order. Results indicate that:
Table 1
Results of the Sequence Identification
Stepa
Constructs of higher
multiplicity
(4 and above) are colored red. The predominant pools, 3 and 4, are
colored blue and purple, respectively. Finally, the RBS placed before
crtE are also color-coded (green for RBS1, yellow for RBS2) to illustrate
the predominance of RBS1 in that position.
K137085 (the weakest promoter) is the most commonly
represented promoter (12/23) and is used in all of the constructs
with high multiplicity (4 or more; all colored in red). This is consistent
with a scenario where toxicity is the limiting factor.Conversely, apFab32 (the strongest promoter) is not
represented at all—again consistent with toxicity as a limiting
factor.The RBS placed before crtE is
almost always the weakest
RBS in context RBS1 (22/23, all colored in green), except in one case
(RBS2, the second weakest one, in yellow)—again consistent
with toxicity as a limiting factor.No
construct uses the same RBS in all three positions
at the same time. This observation contradicts projections by the
EFM calculator (https://barricklab.org/django/efm/)[55,56] that repetitions of the RBS only has a minor
effect on the overall RIP score (adding an extra 0.1), whereas the
choice of promoter has a significantly larger effect (K137085 and
its ATATATATATATATAT sequence being estimated to have add 50 points
to the RIP score). A possible explanation is that the weight used
in the EFM calculator corresponds to much more benign context than
the toxic context our cells are under and therefore underestimates
the possibility of recombination events when all RBS are identical.Pools 3 (EBI; 9/23; colored in purple) and
4 (EIB; 10/23;
in blue) remain the most dominant pools (see Table S4 for the aggregated statistics). They also contain all of
the higher-multiplicity constructs (hence the relative drop in prevalence
compared to the colony results)—hinting at a growth and/or
production advantage for some constructs in these pools.Constructs of higher
multiplicity
(4 and above) are colored red. The predominant pools, 3 and 4, are
colored blue and purple, respectively. Finally, the RBS placed before
crtE are also color-coded (green for RBS1, yellow for RBS2) to illustrate
the predominance of RBS1 in that position.
Location and Multiplicity
When plotted in E–B–I (synthesis)
coordinates, the identified isolates were all found to cluster close
to the origin (Figure A, left). All constructs inherited the low-I distribution
from the design space. Comparison with the whole design space (Figure A, right) illustrates
how concentrated the constructs are close to the origin, compared
to the available space sampled by the design space.
Figure 7
Location of the identified
isolates in the E–B–I coordinate system. (A) Identified
isolates (red dots) cluster very close to the origin—and only
lie in a small portion of the available design space (blue dots).
(B) Constructs are colored according to their multiplicity in the
sequence data set—from blue (low multiplicity) to red (high
multiplicity). A zoom on the region closest to the origin ((B), right)
shows that constructs with the highest multiplicity coincide with
the lowest available value of E.
Location of the identified
isolates in the E–B–I coordinate system. (A) Identified
isolates (red dots) cluster very close to the origin—and only
lie in a small portion of the available design space (blue dots).
(B) Constructs are colored according to their multiplicity in the
sequence data set—from blue (low multiplicity) to red (high
multiplicity). A zoom on the region closest to the origin ((B), right)
shows that constructs with the highest multiplicity coincide with
the lowest available value of E.When construct multiplicity is added to the plots (Figure B, left), it can be seen that
all higher-multiplicity constructs are not only located close to the
origin but also coincide with the lowest values for E-synthesis (Figure B, right)—illustrating results in Table S3, which show that these constructs used the weakest promoter
and the weakest RBS in front of E. These findings
are consistent with the toxicity explanation, where cells die due
to lycopene-induced toxicity (accumulation in the membrane), and therefore,
higher synthesis levels of enzymes lead to higher lycopene production
and accumulation in the membrane. In the E–B–I synthesis coordinates, all isolates
were located in a box [0,2500] × [0,1500] × [0,200]. Highest-multiplicity
isolates (exhibiting a growth advantage) were located in a smaller
box [0,100] × [0,200] × [0,25].
Analysis of the Titration
Assay Data
Lycopene production
and growth data were acquired as per the protocols described in the Methods section: optical density at 600 nm (standard
optical density) and 471 nm, to be, respectively, converted into dry
cell weight (DCW) and lycopene concentration. Corresponding yields
(in mg/g of dry cell weight) were finally computed, as it is the most
commonly used metric for metabolic performance.When using PASIV,
biological replicates are of two sorts. For each isolate that is picked
up, several biological repeats originating from that colony are run
as part of the titration assay (four in the exemplar study)—they
are referred to as “type 1” in this section. Only after
identification of their genetic material can the isolates, and associated
repeats, be grouped according to their genetic material (the construct)—type
2 repeats. For type 1 repeats, the biological samples are from the
same colony, and it is expected that measurements can be safely aggregated
and that the statistics derived from these operations will be indicative
of the performance of the colony (and the accuracy of the measuring
process). For type 2 repeats, however, variations will also be indicative
of the differences in terms of metabolic state for colonies with identical
genetic material. This distinction between types 1 and 2 is especially
relevant to the case of a constitutive design, placing significant
duress on the cells (especially for the higher-producing constructs)
and where assays cannot be synchronized as they would be with an inducible
design—and thus liable to result in significantly different
metabolic states for colonies containing identical genetic materials.Analysis of the measurements was therefore conducted in two steps
to separate sources of variation in the data (interisolate vs interconstruct)
and aggregate data safely. Only analysis of type 2 data (aggregated
by construct) is presented in this section; analysis of the type 1
data can be found in the Supporting Information.Table lists
the
results for the yield for all of the 23 distinct constructs. Three
statistics are used to identify the best performers in terms of yield:
mean of all of the replicates associated with the construct, maximum
over these replicates, and a more robust statistic presented below
to compensate for the possible variance in metabolic states. Both
mean yield and maximum yield identify a similar list of strong performers—the
discrepancies between both lists being caused by the multiplicity
of the construct. Average yields for the constructs with the highest
multiplicity (hence with the highest number of replicates) were often
significantly lower than their max value (by as much as 80%)—indicating
that genotypically identical constructs in the pool were liable to
feature in disparate metabolic states. At the other end of the spectrum,
construct C_01 with a multiplicity of only one returned similar values
of the mean and max. The best performers are, in all but two cases,
among the constructs with higher multiplicity (located in the cluster
closest to the origin). The best performer, C_22 (K137085, RBS1, RBS3,
RBS3), is also located in the cluster. This was surprising, as constructs
in the cluster (with some fitness advantage) were not expected to
be strong producers (an activity imposing duress). The best yields
corresponded to the strongest RBS (RBS2 and RBS3) placed in front
of the final enzyme of the pathway (crtI)—hinting that better
yields could be found with stronger RBS in front of crtI and that
our original choice of RBS was too restrictive (stronger RBS placed
in front of crtI should be tolerated and lead to better yields).
Table 2
Yield Results (in mg/g of Dry Cell
Weight) for All Unique Constructs Drawn with the Pooled Approacha
Constructs of higher multiplicity
(4 and above) are colored red. The best performers in terms of yield
are colored purple (max yield) and blue (mean yield of the replicates).
A new measure of yield is denoted as “new yield”—high
performers according to this metric are colored magenta.
Constructs of higher multiplicity
(4 and above) are colored red. The best performers in terms of yield
are colored purple (max yield) and blue (mean yield of the replicates).
A new measure of yield is denoted as “new yield”—high
performers according to this metric are colored magenta.To decouple interisolate and interconstruct
variation, the following
statistic was added. The statistic is computed in two stages. First,
for all isolates, the average is computed for all features (DCW, lycopene
concentration, and yield). These average statistics are considered
reliable indicators of the behavior of the isolate. Then, for constructs
of multiplicity larger than one, the isolate, the most representative
of the potential of a construct, is identified. The averaged features
of that isolate are then assigned to the construct. In the present
case, the isolate with the largest mean yield is selected as we are
interested in lycopene production (mean lycopene concentration could
also be used). Table shows the value of this new yield statistic (alongside the mean
and max) for all constructs. The new statistic does not penalize constructs
with higher multiplicity as much as the mean of all repeats. Conversely,
because it is based on some averaging, it does not reward outliers
as much as the maximum.The new statistic was used to separate
the constructs into high
and low producers. k-means clustering was used on
the 23-construct data set for two target clusters and using all three
dimensions (yield, concentration, and DCW). Figure A shows the location of both clusters in
EBI coordinates. Red dots are used for the strong performers; blue
dots are used for the weak ones. Red dots are all located in the cluster
closest to the origin and associated with higher multiplicity—confirming
that the cluster and nearby region should be investigated further
if one wishes to find good, reliable performers. A black dot is used
to label construct C_02, which has a high yield but is also associated
with low dry cell weight. As the construct is on its own, surrounded
by blue dots (low producers), it either corresponds to an outlier
(unlikely since it has a multiplicity of 3 in the pool) or is located
in a region of the metabolic landscape (high yield, low OD/DCW) that
is difficult to access via a pooled approach.
Figure 8
Separating the high and
low producers. (A) Location of the high-
and low-producer clusters in EBI coordinates. Red dots are used for
the strong performers; blue dots are used for the weak ones. Strong
performers are all located in the cluster closest to the origin and
associated with higher multiplicity. (B) Result of k-means clustering (n = 2) in the yield vs OD plane
(left) and the concentration vs OD plane (right). Red dots are used
for the strong performers; blue dots are used for the weak ones. Linear
regression was performed on both clusters in both cases.
Separating the high and
low producers. (A) Location of the high-
and low-producer clusters in EBI coordinates. Red dots are used for
the strong performers; blue dots are used for the weak ones. Strong
performers are all located in the cluster closest to the origin and
associated with higher multiplicity. (B) Result of k-means clustering (n = 2) in the yield vs OD plane
(left) and the concentration vs OD plane (right). Red dots are used
for the strong performers; blue dots are used for the weak ones. Linear
regression was performed on both clusters in both cases.Figure B
displays
the result of the clustering in the yield vs OD plane (left) and concentration
vs OD plane (right). Linear regression was performed on both clusters.
In the yield vs OD plane, the high performers (red) show a clear downward
trend (consistent with a classic growth–production trade-off),
while the trend is almost flat for lower producers. In the concentration
vs OD plane, no clear trend can be observed for the high performers
(red).
Identification of a Region for Further Investigation
All viable constructs were located in the [0,2500] × [0,1500]
× [0,200] box (in E–B–I coordinates), while the best performers
were in a much smaller box [0,100] × [0,200] × [0,25] (Figure ). The fact that
higher producers were very heavily clustered inside the region, while
lower producers were spread further from the origin, coupled with
our expectations regarding the effect of toxicity and burden on production,
and the results of the viability assay (very few viable colonies)
convinced us not to look for further regions of interest in the lycopene
case. Instead, it was decided to concentrate on the region containing
the most promising isolates, as it clearly was the region to favor
for the screening phase. In parallel, the larger region containing
all isolates was to be retained for a possible separate screening,
as it had been much more sparsely sampled.
Figure 9
Concluding the scoping
study. The higher-producing constructs feature
in the red box—close to the origin. To compensate for possible
errors in the process, the region is expanded—most noticeably
in the I direction—yielding the purple box,
which is considered the most promising and should be the subject of
further investigation.
Concluding the scoping
study. The higher-producing constructs feature
in the red box—close to the origin. To compensate for possible
errors in the process, the region is expanded—most noticeably
in the I direction—yielding the purple box,
which is considered the most promising and should be the subject of
further investigation.Another important feature
of the construct design is gene order.
Although a safe approach would be to consider all possible orders,
using fewer gene orders would simplify construction workflows. Practically,
it was decided it would be advantageous to only use pools 3, 4, and
6 (EIB, EBI, and IEB) sincePools 3 and 4 yield the most constructs and some of
the strongest performers.Pool 6 yields
fewer constructs but the best performers.Other pools yield very few constructs and weaker ones
than can be found in other pools.Finally,
as a way to account for possible errors in the process
and to make sure that as little as possible of the promising higher-producing
region is left out, the region was extended as follows:Along the B and E dimensions,
the maximum value was doubled.Yield
data for the stronger producers hint that it is
advantageous to have stronger RBS than RBS3 in front of crtI; the
maximum along the I dimension was therefore substantially
increased from 25 to 100.Analysis of
the data has identified two overlapping regions of
the design space: the blue box containing all of the identified constructs
and a red box containing the higher-producing region (in red). To
compensate for possible errors in the process, it was decided to expand
the higher-producing region (in red)—most noticeably in the I direction; the resulting region (in purple) is considered
the most promising and should be the subject of further investigation.
Discussion
The paper has presented a novel workflow to bootstrap
DoE cycles
for combinatorial optimization problems affected by severe toxicity
issues. More specifically, we have presented a novel workflow—which
we have called PASIV—to perform the scoping phase of the DoE
cycle and identify a suitable region of the design phase for the screening
and optimization phases to be conducted on (or subregions of).PASIV is based on a multiplex construction phase via a pooled approach
in conjunction with a viability assay. This experimental phase is
coupled to a software phase, where the isolated colonies are sequenced,
their genetic content (construct) is identified, to finally yield
an estimated location for one (or several) region(s) of interest.
We believe that this form of interplay between biology and software
methodology provides a flexible, time-efficient solution. Even at
the relatively modest scale of this study, identification of the viable
region of the design space would have been a cumbersome, lengthy,
and unreliable process without these tools and methods.Although
PASIV has been illustrated with lycopene (a very bright
pigment with a short pathway), it is worth emphasizing that the workflow
has not been developed for a broader range of applications. The modular
assembly method, BASIC, is also known to perform well for longer pathways—thus
extending the range of pathways PASIV can be applied to. The screening
phase is also not dependent on the presence of a visual reporter (be
it a colorful metabolite or through the addition of a dedicated sensor[57]). PASIV rests on a viability assay indeed—viable
colonies are to be picked regardless of the visual stimulus they emit.
The viability assay is followed by a titration assay, which is specific
to the metabolite of interest. For instance, in the lycopene exemplar
(see the Methods section), lycopene was first
extracted with[50] DMSO from the liquid culture
and its concentration was estimated from absorption reading at 471
nm.[50] Visual reporters could, in theory,
be used to steer toward high-producing regions in the picking phase,
but no attempts to do so were conducted in our work, as too few viable
colonies were present on the plates and the authors were concerned
with importing possible biases in the screening phase. As for the
software phases of the workflow, they are agnostic to the chosen pathway.In the lycopene exemplar, a homogeneous, contiguous region of space,
containing strong performers could be identified. The region, as could
be expected for a problem where toxicity plays a major role, corresponds
to the weakest constructs. Equivalently, it was located close to the
origin with the minimal set of coordinates (the enzyme synthesis rates)
used by PASIV. Although extrapolating from the data yielded with PASIV
should be done with suitable care, there are good reasons to consider
the identified region for further investigation.The identified
region had an enviable set of properties. Collected
data show a continuous, viable region of space that contains healthy
and productive colonies: some constructs indeed yield colonies with
high multiplicity, with a clear growth advantage and performing well.
In the context of metabolic engineering, these constructs would be
good choices as they are, most likely, reliable performers. Conversely,
constructs that fail the pooled approach, while possibly capable of
good production, are more likely to be subject to the toxicity effects
and be unreliable performers.Higher producers were also very
heavily clustered inside the region,
while lower producers were spread further from the origin. This fact,
coupled with our expectations regarding the effect of toxicity and
burden on production, and the results of the viability assay (very
few viable colonies) convinced us not to look for further regions
of interest in the lycopene case. Instead, the region containing the
most promising isolates was expanded (as shown in the Results and Discussion section and further discussed below)—as
it clearly was the region to favor for the screening phase. In parallel,
the larger region containing all isolates was retained for a possible
separate screening, as it contained more constructs but had been much
more sparsely sampled. Metabolic landscapes are in most cases not
as accommodating as the lycopene landscape. And it is, in general,
not possible to tell with certainty if the whole of the viable region
has been identified without several repeats of the workflow—especially
for more complex pathways where several viable regions may exist.
Even in such a case, advantages in using PASIV remain. The multiplex
construction remains fast and simple as all components are mixed at
once in a one-pot reaction—with identical experimental conditions.
More importantly, each new round of PASIV will yield constructs that
accumulate in the viable regions—which remains a better outcome
than the outcome of random sampling rounds (with a lot of failures).Turning the locations of the identified constructs into a region
of interest for further investigation means performing a practical
trade-off between efficiency (region should not be too large) and
safety (region should be large enough)—and involves design
decisions that are typical of a scoping study. On the one hand, preliminary
analysis of the data can lead to some forms of restriction on the
region of interest. In the lycopene exemplar, a reasonable case could
be made for not using all six gene orders but only three instead.
On the other hand, it is better to identify a region that is too large
than too small, lest the best-producing constructs are missed. PASIV
is inherently conservative: its goal is not to identify some optimal
region, but a region worthy of further investigation instead. Also,
due to the way the viability screening works, gaps between identified
constructs are to be expected and should be filled—the viability
assay will miss some viable constructs, even if they are represented
on the plates, as there are practical limits to the number of colonies
that can be picked. In the Results and Discussion section of this work, a very simple strategy was adopted: the E–B–I coordinates
were used to group all of the constructs that were identified into
a single box. An extension of that box was also performed—practically
the boundaries of the previous box were changed to give ourselves
a margin of error for the entire PASIV process.Although the
development of a rigorous statistical framework to
assess whether the viable region(s) have been identified already,
or if more rounds are needed, is beyond the scope of this work, some
elements of such a framework are worth discussing. Since the purpose
of PASIV is the safe identification of viable regions and is followed
by a targeted sampling of these regions, it is enough to ensure its
outcome remains stable over a range of iterations. We suggest using
an equivalent criterion, but more amenable to quantification, and
turn PASIV’s outcome into a simple predictive model (inside
the box(es): predicted to be viable, outside: predicted not to be
viable) and track its performances over a range of iterations of PASIV
by comparing its predictions against the data collected during a new
iteration. A stopping condition can easily be constructed from the
standard classification performance metrics (recall and F1 since the
scoping round aims to avoid false negatives). When there are several
subregions, clustering metrics (silhouette score, Davies–Bouldin
index, Calinski–Harabasz index) can also be used to quantify
the evolution of the clustering results. In parallel, as more data
are collected, estimation of the production landscape (mean, confidence
interval) can be refined, as can the low-producer/high-producer clustering.The benefits of conducting a proper titration assay in the second
phase, and not just a viability assay, extend beyond collecting data
that will be of use in the subsequent rounds of the DoE cycle: the
collected data can be useful in identifying the location of the region
of interest, as shown in the lycopene example. Clustering identified
two clusters (one of lower producers, one of higher producers)—the
stronger producers being located very close to the origin—and
corresponded to the lowest values of E-synthesis.
Also, the yield data hint that it is advantageous to have stronger
RBS than RBS3 in front of crtI and that regions with strong producers
should be expanded along the I dimension.A
benefit of using coordinates to define a region of interest is
to enable resampling of the said region to make it less sparse—this
would be of clear benefit to generate constructs even closer to the E-axis. In practice, it means identifying more constructs
using the specified operon design but outside the original design
space as they use new components. This can be achieved, for instance,
by extending the RBS libraries to include all of the members of the
Biolegio collection and drawing from the constructs that fall in the
newly identified region of interest. Likewise, more suitable constructs
can be generated by adjusting the promoter in the operon and using
weak promoters.Finally, another benefit of operating at the
region level (rather
than individual construct level) and adding margins of error to the
location of the region of interest is to be found in the final visualization
step in PASIV. The visualization phase relies on a minimal set of
coordinates related to enzyme synthesis and for these coordinates
to be estimated for each element in the design space. Transcription
values were derived from promoter characterization data in a similar
context. For translation, it was decided to rely solely on a bioinformatic
tool, the RBS calculator, to estimate the translation rates of all
of the RBS involved in the study and for all possible upstream and
downstream contexts. The decision was taken for several practical
reasons. First, the RBS calculator is a popular tool that is consistently
improved. Second, experimental characterization of the RBS in context
(to determine translation rates) was cumbersome:First, common reporters such as GFP cannot be used outright
since their first 60 bp differ from the crtE, crtI, and crtB enzymes.
Modifying the reporters at their 5′ ends was rejected as too
liable to modify the reporters’ properties. Using a fusion
protein was also rejected as it was liable to alter too much the metabolic
burden (and could lead to folding problems).His-tag characterization of protein concentration was
also investigated but protein purification considerations, concerns
about folding, and concerns about potential overexpression of the
recombinant proteins[58] led to its rejection.Finally, although there are other methods—most
notably targeted
proteomics[59,60]—that are well suited to
directly assess protein levels, they were rejected for the scoping
phase, as these modern capacities are not widely accessible yet, and
PASIV relies on growth information first (multiplicity and OD) and
production information second. These methods should of course be considered
for the subsequent targeted rounds of the DoE cycle.Although
it is tempting to use calculators, when possible, due
to their simplicity of use, it is important to bear in mind that their
use may lead to some features of some of the constructs being estimated
with a significant margin of error—making individual construct
predictions risky. While this is a potential issue in the screening
and optimization phases (protein measurements should then be conducted),
we argue that it is less of a problem in the scoping phase as conducted
in PASIV. PASIV’s purpose is the identification of a region
that is worthy of further investigation. PASIV is not so much concerned
with minimizing false positives (including constructs that should
not be) than maximizing true positives (making sure the region covers
the right area of the design space). This is reflected in the simple
manner the region of interest is reconstructed from the data (clustering
the identified isolates into boxes with added margins for errors).
Conclusions
The PASIV approach presented in this work offers a simple solution
to a particular issue in DoE cycles: what can be done if the scoping
phase is severely hampered by burden and toxicity issues. It is also
a showcase for a systematic approach to synthetic biology based on
a few principles:As far as
possible, the biology should be allowed to
behave naturally—the first two phases of PASIV harness biology,
so the viable region reveals itself.Reproducibility, reliability, and robustness are key.
In PASIV, they featured in the design of the workflow—to account
for possible failure modes—as well as the conservative approach
adopted to identify the viable region.Automation and software are to be deployed as much as
possible to operate at scale and make it possible to reliably collect
and process large amounts of data. PASIV would not work without its
software identification phase (phase 3).Synthetic biology should move further into the direction
of data science—and the biology be abstracted as much as possible.
The visualization phase (phase 4) is but a small example of such an
application.As a consequence of these
principles, several computational tools,
as well as a set of problem-specific experimental methods, have been
developed. Although the most obvious application area for these tools
and methods is metabolic engineering and large combinatorial optimization
problems, we are confident that they can be applied to a wider range
of problems.
Methods
BASIC DNA Assembly
BASIC linkers and part preparation
were done using the standard BASIC part preparation protocol (https://www.basic-assembly.org/protocols). BASIC assemblies were also done following the standard BASIC assembly
protocol in a 96-well standard plate either manually or with the Opentrons
OT-2.
Preparation of BASIC Linkers and BASIC Bioparts
BASIC
linkers were obtained from Biolegio. Synthesized genes were ordered
as gBlocks from NEB. New linkers are prepared as follows:Spin down the tubes
with lyophilized
linkers to ensure oligos are at the bottom of the tube.Set the heating block to 95 °C.Add 200 μL of the
linker annealing
buffer to each linker tube and leave it on the bench for 1 h.Vortex the tubes and collect
the liquid
at the bottom of the tube with a quick centrifuge spin.After the heating block reaches 95
°C, place the tubes into the block and slightly loosen the tube
caps to allow for heat expansion.After 5 min, switch off the heating
block and tighten the tube caps again to avoid evaporation.Allow the tubes to cool
down to room
temperature over at least 1 h in the heat block.Collect the solution at the bottom
of the tube with a quick centrifuge spin.Linkers are stored at −20 °C.
Bioparts for the BASIC assembly are provided in storage plasmids (pSEVA18).
For each BASIC linker ligation reaction, 50 ng of plasmid per 1 kb
of total plasmid size (including BASIC part and storage backbone pSEVA18)
is required. Usually, that amount of DNA is provided in 1 μL
of a typical miniprep of biopart storage plasmids (200 ng/μL
for a 4 kb plasmid). If polymerase chain reaction (PCR) products or
gene fragments are used as reaction inputs, 50 ng per 1 kb linear
DNA is required.
BASIC Reaction
For each BASIC linker ligation reaction,
one PCR tube with 30 μL total volume was set up: dH2O 17 μL, Promega T4 buffer (10×), 3 μL prefix linker,
1 μL suffix linker, 1 μL BASIC biopart 0.5–6 μL
(50 ng per 1 kb total plasmid size), dH2O was added to
reach 28.5 μL volume, NEB BsaI-HF v2 enzyme
(R3733) 20 U/μL, 1 μL Promega T4 ligase (M1801), 1–3
U/μL 0.5 μL mix by pipetting up and down. After mixing,
the tubes are placed in a PCR machine running the following program:
(37 °C, 2 min) × 20 cycles (20 °C, 1 min), (37 °C
5 min), (80 °C, 20 min).
Magbead Purification
This was done using 0.5 mL of
70% EtOH per BASIC reaction and bringing the magnetic beads stored
at 4 °C back into the homogeneous mix by shaking thoroughly.
We used 96-well Falcon plates (Falcon 351177) in combination with
an Ambion magnetic plate (AM10050) for quick magbead immobilization
and easy pipetting access. Purification protocol was applied as follows:Add 54 μL of
magnetic beads into
a 96-well Falcon plate (one well per BASIC reaction) and add 30 μL
BASIC linker ligation from the PCR machine step, mix by pipetting
10 times.Wait 5 min
to allow DNA binding to
magbeads.Place the
Falcon plate on the magnetic
stand and wait for rings to form and the solution to clear.Remove the solution with
a 200 μL
pipette tip from the center of each well.Add 190 μL 70% EtOH to each well
and wait for 30 s.Remove
the solution from each well
(pipette set to 200 μL volume).Add 190 μL 70% EtOH to each well
and wait for 30 s.Remove
the solution from each well
(pipette set to 200 μL volume).Leave the plate to dry for 1–2
min.Remove the Falcon
plate from the magnet
and resuspend magbeads in 32 μL dH2O.Wait for 1 min for DNA to elute.Place the Falcon plate
back on the
magnetic stand and allow the ring to form and the solution to clear.Pipette 30 μL of
H2O with eluted DNA into a fresh 1.5 mL Eppendorf tube
for direct use
in assembly or storage at −20 °C for up to 1 month.
Assembly Reaction
For each BASIC
assembly, parts were
combined with buffer in a PCR tube: dH2O 2 μL, NEB
CutSmart buffer 10 × 1 μL, linker ligated BASIC part crtE
1 μL, linker ligated BASIC part crtB 1 μL, Linker ligated
BASIC part crtI 1 μL, dH2O top up to 10 μL
total volume. Assembly reaction is run in a PCR machine with the following
program: 50 °C for 45 min followed by 4 °C on hold.
Transformation
Fifty microliters of chemically competent
cells DH5alpha with high transformation efficiency (109 CFU/μg pUC19, for instance, NEB C2987I) was used to transform
5 μL of each BASIC assembly:Chemically competent cells are stored
at −80 °C.Thaw the competent cells on ice (takes
5–10 min); 50 μL per BASIC assembly to be transformed.Cool 5 μL of the
BASIC DNA assembly
in a 1.5 mL Eppendorf tube on ice.Add 50 μL of competent cells
to each precooled 5 μL BASIC reaction.Incubate on ice for 20 min.Apply heat shock in a 42 °C water
bath for 45 s and place back on ice for 2 min.Add 200 μL of the SOC medium
to each tube and incubate shaking at 37 °C for 1 h recovery.Spot or plate cells on
agar plates
with the appropriate antibiotics. Depending on the number of parts
assembled and the transformation efficiency, 2–250 μL
might be spotted or plated.Incubate agar plates at 37 °C
overnight, and the next day, pick the colony for assay or miniprep.
We used a PCR machine, heat block (up to 95 °C) fitting 2 mL
Eppendorf tubes, water bath (42 °C) for transformation, magnetic
plate Ambion AM10050 (Thermo), 96-well U-bottom Falcon plate, Falcon
351177 (Thermo) Eppendorf tubes, magnetic beads Ampliclean (Nimagen),
dH2O, 70% EtOH, Biolegio BASIC linkers BBT-18100 (Biolegio),
BASIC parts in storage plasmids (200 ng/μL), NEB BsaI-HF v2 enzyme (R3733), 20 U/μL; includes CutSmart buffer R3733
(NEB), Promega T4 ligase (M1801) 1–3 U/μL; includes Promega
T4 buffer M1801 (Promega), chemically competent cells (DH5alpha, 1
× 109 CFU/μg pUC19); includes SOC media C2987I
(NEB).
Pooled Transformation
Instead of single RBS at a single
location, we use three RBS while maintaining a relative proportion
of RBS (relative to volume). The promoters, RBS, and genes were added
to the transformation mix. The order of the linkers in the BASIC assembly
allows the fixed ordering of the three genes (crtE, crtB, crtI), hence,
the six different pools containing six different gene order combinations.
Cloning and Cultivation
DH5α (New England Biolabs) E. coli was used for cloning with a standard approach. E. coli cells were grown in Luria Broth (LB) medium
supplemented with 25 μg/mL kanamycin (Kan). DH5α E. coli starter cells were grown at 37 °C. Cell
cultures for lycopene production were grown at 28 °C (optimum
growth temperature for lycopene production).
Extraction and Measurement
of Lycopene
The 24 h growth
plate has its OD600 measured on a plate reader (Clariostar).
Fifty microliters from the 1000 μL total volume is added to
150 μL of LB in a new, flat-bottom plate; it is mixed thoroughly
and then tested. The final OD value that the plate reader gives us
is multiplied by 4 to compensate for this dilution. The remaining
culture is spun at 4000 rpm for 15 min, at room temperature, until
pellets appear. The supernatant is removed by turning the plate upside-down
onto paper towels. The pellets are resuspended in 500 μL of
DMSO, using a robot (OT-2) or a multichannel pipette. The plates are
then incubated at 37 °C for 30 min at 900 rpm on a benchtop incubator,
to extract the lycopene, while covered in aluminum foil to limit exposure
to light. The plates are then spun down to generate pellets (15 min
at 4000 rpm). Two hundred microliters of the lycopene-containing DMSO
was moved to a flat-bottom plate to measure the OD at 471 nm (Clariostar).
Sequence Analysis and Part-Matching
To analyze the
Sanger sequences and identify the best candidate parts of the constructs,
we used the cMatch software. This allowed us to automate our workflow
and was sorely needed to procedurally match against the 810 possible
in silico constructs.We sequenced with three distinct reverse
primers starting at 1/3 in each of the three genes (crtB, crtI, and
crtE) (Figure ).
The three subsequences obtained are approximately 800 base-pair long,
which is long enough to let us determine the upstream parts. The multi-input
variant CM_2 of the cMatch algorithm was used.
Figure 10
Lycopene operon with
the three reverse sequencing primers for the
three genes crtE, crtI, and crtB.
Lycopene operon with
the three reverse sequencing primers for the
three genes crtE, crtI, and crtB.
Yield Calculation
Cell growth (E. coli DH5α) was followed by measuring the optical density at 600
nm and correlated to cell dry weight (CDW) with a ratio of CDW/OD
= 0.36.Ratio of cell dry weight (g/L)/OD600.
Authors: Marko Storch; Arturo Casini; Ben Mackrow; Toni Fleming; Harry Trewhitt; Tom Ellis; Geoff S Baldwin Journal: ACS Synth Biol Date: 2015-04-16 Impact factor: 5.110
Authors: Lauren B A Woodruff; Thomas E Gorochowski; Nicholas Roehner; Tarjei S Mikkelsen; Douglas Densmore; D Benjamin Gordon; Robert Nicol; Christopher A Voigt Journal: Nucleic Acids Res Date: 2017-02-17 Impact factor: 16.971
Authors: Jason R Kelly; Adam J Rubin; Joseph H Davis; Caroline M Ajo-Franklin; John Cumbers; Michael J Czar; Kim de Mora; Aaron L Glieberman; Dileep D Monie; Drew Endy Journal: J Biol Eng Date: 2009-03-20 Impact factor: 4.355