Literature DB >> 35704784

Machine Learning Guided Batched Design of a Bacterial Ribosome Binding Site.

Mengyan Zhang^1,2,3, Maciej Bartosz Holowko^4,5, Huw Hayman Zumpe^4,5, Cheng Soon Ong^1,2,3.

Abstract

Optimization of gene expression levels is an essential part of the organism design process. Fine control of this process can be achieved by engineering transcription and translation control elements, including the ribosome binding site (RBS). Unfortunately, the design of specific genetic parts remains challenging because of the lack of reliable design methods. To address this problem, we have created a machine learning guided Design-Build-Test-Learn (DBTL) cycle for the experimental design of bacterial RBSs to demonstrate how small genetic parts can be reliably designed using relatively small, high-quality data sets. We used Gaussian Process Regression for the Learn phase of the cycle and the Upper Confidence Bound multiarmed bandit algorithm for the Design of genetic variants to be tested in vivo. We have integrated these machine learning algorithms with laboratory automation and high-throughput processes for reliable data generation. Notably, by Testing a total of 450 RBS variants in four DBTL cycles, we have experimentally validated RBSs with high translation initiation rates equaling or exceeding our benchmark RBS by up to 34%. Overall, our results show that machine learning is a powerful tool for designing RBSs, and they pave the way toward more complicated genetic devices.

Entities: Chemical

Keywords: genetic part design; machine learning; optimization; ribosome binding site

Mesh：

Year: 2022 PMID： 35704784 PMCID： PMC9295160 DOI： 10.1021/acssynbio.2c00015

Source DB: PubMed Journal: ACS Synth Biol ISSN： 2161-5063 Impact factor: 5.249

Introduction

One of the main tenets of synthetic biology is the design, evaluation, and standardization of genetic parts.[1−3] A central challenge is part design, which is understood as modifying the sequence of a genetic part for it to meet specific requirements. Genetic parts are the units which are ultimately combined into more complex genetic circuits that produce desired functions in the target organisms. This is usually done in terms of the Design–Build–Test–Learn (DBTL) cycle, where a given genetic part or organism is continually improved through an iterative process. This cycle involves designing (D) new DNA sequences to achieve a desired property in computer-aided design software, then physically building (B) new constructed variants and testing (T) using an analytical instrument in a laboratory. Computer modeling can be used to learn (L) and predict the characteristics of a genetic part.[4,5] Most of these computer models are based on either the thermodynamic properties of the involved molecules (DNA, RNA, proteins, etc.) or empirically obtained values describing a relevant design property, like translation initiation rate (TIR) in the case of ribosome binding sites (RBSs).[6−8] However, de novo design of small genetic elements is challenging because of unknown relationships between their sequence and performance. In this paper, we propose a machine learning guided Design–Build–Test–Learn cycle for the experimental design of bacterial RBSs. This consists of two distinct types of machine learning methods, one in the Learn phase and a second in the Design phase. We show how small genetic parts can be reliably designed using even relatively small, but high-quality data sets. This work focuses on RBS part design and TIR prediction, rather than looking at its wider genetic context and impact on the general performance of the cell. As the RBS is one of the key genetic elements controlling protein expression in bacteria, it is a suitable target for establishing a workflow that could be potentially translated to more complicated systems. In the Design phase of the DBTL cycle, designers often fine-tune the characteristics of parts to give the resulting strains their desired properties. The ability to predict a characteristic of a genetic part (for example using a machine learning method) only provides inputs to the problem. The designer still needs to choose from the large number of possible variants to Build. For instance, to increase the yield of a protein, increasing the TIR of the RBS responsible for translation of that protein could be targeted. Hence, the goal of the Design phase of TIR is to select a small set of RBS sequences from the design space (i.e., all possible DNA sequences) based on the predictions from Learn phase. However, since the computational predictions are not perfectly accurate, there is uncertainty about which potential RBS sequence has the desired property (for example highest protein yield). The designer could use the mean predicted TIR values and choose the best ones to exploit the knowledge modeled by the computational predictions. But the designer may also want to explore regions of DNA design space where the predictions are highly uncertain (e.g., as measured by the predicted TIR standard deviation) and could yield high payoff. Therefore, a major challenge of the Design phase is to balance the exploitation and exploration, which is addressed by one class of machine learning approaches called multiarmed bandits.[9] As we will see in this paper, multiarmed bandits are well suited to solving the challenge of recommending a small set of RBS sequences in the Design phase of the DBTL cycle. One way to generate predicted TIR values is to use existing RBS calculators.[8] Three main RBS calculators were surveyed, all of which predict the TIR based on the thermodynamic properties of the RBS and the ribosome.[10−12] Predictions reported from all of these models range from relatively good (R2 > 0.8) to low (R2 < 0.2) depending on the data set.[13] This may be caused by the following: (i) they rely on calculations of free energies, which can be difficult to estimate with high precision, (ii) in general, one of the best ways to improve the models’ accuracy is by increasing the number of phenomena taken into account, which can lead paradoxically to decreased model accuracy due to accumulation of errors,[14] and (iii) by using deterministic coefficients to calculate energies, one disregards the often stochastic nature of processes in cells, which can potentially increase prediction error.[15] There is also evidence that binding energy calculations may be poor predictors of RBS strength.[16,17] This is reinforced by studies suggesting that RNA secondary structure is potentially a more important feature in TIR determination.[14,18] This suggests that multiple interactions determine the mRNA-ribosome binding, and predicting TIR from the genomic sequence is still challenging. Recent work has explored the use of machine learning predictors to learn from historical data and generate predictions for use in synthetic biology, vastly improving the DBTL cycle’s performance.[19−21] This work leverages the exponential increase in experimental data produced in synthetic biology[22] to improve predictors in the Learn phase of the DBTL cycle. For example, Jervis, Carbonell et al.[23] used support vector machines and neural networks to optimize production of monoterpenoids in Escherichia coli. Similarly,[24] others have used a number of machine learning approaches to analyze time-series multiomics data to predict metabolic pathway behavior. Deep learning techniques have also been successfully used to analyze large synthetic biology data sets.[25−27] Recall that the multiarmed bandit approach balances the exploration–exploitation trade off by using predictive uncertainty. The machine learning predictors mentioned, as well as existing RBS calculators, do not yet provide uncertainty levels to guide exploration with the notable exception of ART.[20] Furthermore, large data sets may not be available for a genetic part of interest, for instance the RBS. Hence there is a need to develop methods for training predictors on smaller data sets and generate predictions for both their mean values and uncertainties of the predictions in the Learn phase of DBTL. The predictions of characteristics of interest for each potential variant can then be used in the Design phase. Our overall experimental goal is to maximize the TIR by building and testing batches of RBS sequences with only a small number of DBTL cycle iterations. We demonstrate how machine learning (ML) algorithms can be incorporated into the DBTL cycle to predict (Learn) and recommend (Design) variants of a bacterial RBS with the goal of optimizing associated protein expression level in the specific genetic context (i.e., with specified bases upstream and downstream of the investigated RBS sequences). Our proposed machine learning guided DBTL cycle is summarized in Figure . Two types of ML algorithms are applied in the DBTL cycle. In the Learn phase, the goal is to train a predictor that will predict the protein expression level by learning from the logged data. We used a Bayesian, nonparametric regression algorithm, called Gaussian Process Regression (GPR)[28] in our pipeline. In addition to providing a predicted mean TIR, GPR also provides uncertainty estimates via its predicted standard deviation. GPR has been shown to perform well with small amounts of training data in biological prediction tasks.[28,29] The goal of the Design phase is to find a policy to recommend RBS sequences to query in batches so that we can identify the optimized RBS sequences within a given budget. We use a version of the Upper Confidence Bound multiarmed Bandit algorithms[29] to learn the policy. The policy uses the outputs from the GPR predictions to recommend useful batches for the Build and Test phases. We demonstrate that laboratory automation methods in the Build and Test phases result in high quality TIR data that is well suited for training machine learning methods in the subsequent Learn phase. The two types of algorithms cooperate with each other and provide powerful tools for DBTL cycle in genetic parts design.[29,30] Using our proposed machine learning guided DBTL cycle (Figure ), we were able to increase expression of our target protein by up to 35%, as compared to the very strong benchmark RBS.

Figure 1

Flowchart of the machine learning-based experimental design. The RBS design is recommended by the Upper Confidence Bound multiarmed Bandit algorithm. After generating the recommendations, the RBSs are built and tested using automated laboratory methods allowing for rapid construction and testing at scale. Finally, the obtained results are fed back to the Gaussian Process Regression prediction algorithm in the Learn phase. n is the current design round and k is the maximum number of rounds allowed by time and/or money. In regards to the “Goal Met?” condition, the goal in our case was to find sequences with TIR significantly higher than benchmark, but the goal can be generalized to fit the requirements of the user.

Results

We present our RBS-optimizing DBTL workflow that uses machine learning in Section . Machine learning is used in two different ways: (i) we show the efficacy of the ML recommendations in the Design stage (Section ), and (ii) we demonstrate that the ML predictions are accurate in the Learn stage (Section ). We present our new RBS sequence library in Section and describe some interesting characteristics of the discovered sequences, as well as show the effectiveness of the automated laboratory workflow.

The Experimental Workflow

Our DBTL workflow, which uses machine learning to optimize protein expression, is shown in Figure . The Build and Test phases are driven chiefly by choices made by human researchers and the use of automated methods. Machine learning algorithms are applied in Learn and Design. In the Learn phase, we use the Gaussian Process regression algorithm to predict the TIR of RBS sequences comprising the experimental space. In the Design phase, we use the Bandit algorithm to recommend new RBS sequences based on the predictions from Learn. The exact position of the RBS in the sequence upstream of the protein coding sequence (CDS) can be hard to pinpoint. However, most previous studies place the RBS in the 20 bp directly preceding the coding sequence. In our investigation, we are using an RBS that is known to have a very high TIR when expressing GFP and is present in the pBb series of plasmids.[31] This template RBS sequence is 20 base pairs (bp) long with the sequence TTTAAGAAGGAGATATACAT, where the highlighted nucleotides constitute the core of the RBS. As this is the sequence against which new RBS sequences will be benchmarked, we will refer to this sequence as the benchmark sequence hereafter. Additionally, we have experimentally confirmed that modifying the core sequence is statistically more impactful on TIR than changes made outside of it (see Figure S5). This hypothesis has been built based on reported biases toward certain bases present in the core of the RBS but absent outside of it. For example, according to ref (32) there is a strong bias toward A and G bases in the core region of the RBS. Similarly, outside of the 6 bases of the core in the wider 20 bp context of the RBS there is no significant bias toward any particular base, which suggests that these bases do not contribute to the overall TIR of a given RBS. Focusing on the 6 to 8 bp core sequence is a common RBS design approach.[33] Hence, in our design, we focus on modification of the core at nucleotide positions −8 to −13 (relative to the start codon of the GFP; this is where the consensus Shine–Dalgarno AGGAGG sequence is usually found in wild type E. coli) of the RBS and we keep other positions the same as the benchmark sequence, i.e., TTTAAGA + NNNNNN + TATACAT, where N can be any nucleotide (A, C, G, T). The total experimental (variant) space is then 46 = 4096. In our genetic design, the investigated RBS controls the expression of green fluorescent protein (GFP). By controlling expression of a fluorescent protein with the RBS we can quickly assess the perceived relative TIR by measuring fluorescence of cells harboring the expression vector over time. Finally, the mRNA is transcribed from an IPTG-inducible promoter pLlacO-1. Inducible expression allows synchronization of the start of the GFP expression in all the cultures by addition of IPTG. Since standardization and comparative studies should be done in as similar genetic contexts as possible, the design of this device has been deliberately kept simple to make such studies easier.[34]

Design: Performance of the Recommendation Algorithm

The Bandit recommendations were made using the batch Upper Confidence Bound multiarmed Bandit algorithm.[29] In short, this algorithm is a stochastic method of probing the experimental space, and is sometimes referred to as Bayesian optimization. It maximizes the reward (output) from testing a limited number of instances from a large pool that cannot be tested exhaustively because of limited resources (time, computational power, money). It balances the exploration–exploitation paradigm, where exploration focuses on testing data points that maximize information gain and exploitation focuses on recommending RBSs with high predicted TIR. Figure A shows the results for all the RBS groups tested experimentally. In each experimental round, in addition to the new RBS designs, we measure the TIR of the benchmark RBS as the internal standard. We then obtain the normalized TIR (called TIR ratio) by taking the ratio between the raw TIR of a new design and the average TIR of benchmark sequences in each round (which are run in triplicate in each round). Figure S6 shows these results in terms of raw TIRs.

Figure 2

TIRs of RBS groups examined in this study. (A) Swarm plot showing the obtained TIRs divided into RBS groups. BPS-NC: Base-by-base changes in the noncore region. BPS-C: Base-by-base changes in the core region. UNI: Randomly generated sequences with uniform distribution. PPM: Randomly generated sequences with distribution following the Position Probability Matrix for all natural RBS in E. coli. Bandit-0/1/2/3: Bandit algorithm generated results for Round 0, 1, 2, and 3 respectively. SD: Shine–Dalgarno sequence. Dashed line is set to 1 and represents the averaged benchmark sequence TIR for that group. BN: Benchmark sequences for all plates (not all are exactly 1 because of them being shown as separate samples rather than per round averages). (B) Line plot showing TIR obtained in a given quantile (Q) of results divided into groups as in (A). UNI and PPM are merged into Random group, and BPS-NC is not shown because of changes being made outside the core in that group. (C) Exploitation vs Exploration for Bandit 1–3. Blue-hued points represent exploitation, those hued red represent exploration. (D) Histogram with kernel density estimations (KDE) showing distributions of TIRs for Bandit groups. (E) t-SNE plot showing the relative distances between sequences in our design space as calculated by our kernel function (weighted degree kernel with shift). The Unlabeled points represent the RBS sequences in the design space that have not been tested. The area of the marker corresponds to the experimentally obtained TIR value. The TIR results in all subplots are shown normalized to the respective benchmark sequence sample, which acts as internal standard; the TIR of a given RBS is divided by the TIR of the benchmark RBS run in the same plate. To generate the data set from which the algorithm would learn, we decided to characterize a total of 450 RBS variants, a little over 10% of the whole experimental space. About a quarter of the designs are experimental controls to provide a baseline for comparison. To fit our automated workflow, we divided the 450 variants into batches of 90, split into 4 design rounds. In the zeroth round we tested two batches of designs, giving a total of 180 variants split as follows: BPS-NC and BPS-C group: 60 RBS sequences that are subsequent single nucleotide variations of all 20 nucleotides of the original, benchmark sequence. This batch is designed to show us the influence of such single nucleotide changes on the overall performance of the RBS and the potential impact of changes made beyond the core part (see Supplementary Figure S5). UNI group: 30 RBS sequences that were uniformly randomized, i.e., equal probability of choosing any nucleotide for each position. This group shows the performance of RBSs generated randomly. PPM group: 30 RBS sequences randomized based on the position probability matrix (PPM) generated from all the naturally occurring RBS sequences in the E. coli genome.[35] This group shows the performance of RBSs generated randomly, but following the natural nucleotide distribution. Bandit-0: 60 RBS sequences recommended by our implementation of the recommendation algorithm based on a data set obtained from literature,[36] which contains 113 nonrepeated records for 56 unique RBS sequences with their respective TIRs. This data set has been used because of the perceived similarity of its goal to that of this work—prediction of TIR based on phenotypic output. In the subsequent 3 rounds, with one batch each, all 90 designs were generated using our machine learning algorithm based on the data obtained from the previous rounds (these groups are called Bandit 1–3, respectively). All Round 0 groups (BPS-NC, BPS-C, UNI, PPM, Bandit-0) performed worse than our benchmark sequence in terms of TIR. The best-performing group was the BPS-NC, which is explained by the relatively small impact on the TIR of changes made outside the RBS core. The Bandit-0 group’s performance is similar to randomly generated designs, despite being machine learning-driven, because of being trained on approximate data. Starting from Round 1, where the prediction and recommendation algorithms were fed data from Round 0, the results improved significantly, with a number of sequences performing better than the consensus Shine–Dalgarno sequence and in one case, better than the benchmark (by 8%). In Round 2 we observed further improvement by obtaining more sequences with TIRs similar to our benchmark sequence. Finally, in Round 3 the algorithm identified two sequences that were 34% and 15% stronger than the benchmark sequence. In summary, 120 out of 450 sequences (BPS-NC, BPS-C, UNI, PPM) are experimental controls created by sequence randomization. Only a few of the random sequences were promising, but they were still 20% weaker than the benchmark sequence’s TIR (Figure A). In fact, these 80% TIR ratio sequences were created by randomizing the sequence outside of the core RBS region (BPS-NC), which was statistically shown not to be significantly impactful on the TIR (note, this applies only to the 7 bps on either side of the core; see Supplementary Figure S1 and work by ref (33)). Sequences from randomization of the core RBS (BPS-C, UNI, PPM) are more appropriate controls, from which the best sequence achieved only about 65% of the benchmark TIR. These results show that generating a strong RBS sequence by random mutation is a nontrivial task, when the tested data set is relatively small. In contrast, our Bandit-driven design gives much better results, with RBSs getting close to benchmark performance and even exceeding it.

Figure 3

Performance of the prediction algorithm (no kernel normalization). The scatter plots A–D show the performance of the prediction algorithm calculated after each round. Note that the TIR values are normalized according to the standardization described in Section , which is different from the TIR ratio reported in Figure . The x-axis and y-axis are, respectively, the true measured TIR and the predicted TIR. In (A–C), we show at round t = 1, 2, 3, respectively, we train our predictor based on the previously obtained data (round 0 to t – 1) and show the predictions on both the training data (orange) and recommendations suggested by the Design phase (i.e., test data, blue). In (D), the predictor was trained on a randomly chosen 80% subset of all available data, and tested on the remaining 20% data. The Spearman correlation coefficient (with corresponding p-value; calculated using test data only) are provided in each plot’s title on test data only. The p-value here is for the null hypothesis stating that two sets of data are uncorrelated, providing strong evidence that the predicted TIR are accurate. Figure B shows the same results but divided into quantiles where the specific point for a given group shows the highest TIR for that quantile. The gradual increase for all quantiles can be observed for all Bandit groups, suggesting that the algorithms have a better understanding of the experimental space given more data. The decreased result in the 0.9th quantile compared to the maximum value for the Bandit 3 group can be attributed to the increased emphasis on exploitation that has been set for that round compared to others. We see this effect in Figure C (with details shown in Supplementary A.6 and Figure S1), where we colored the data points for Bandit 1–3 groups according to their relative exploration–exploitation affinity. Those with a high predicted mean are colored blue and represent exploitation, those colored red are with high predicted uncertainty and represent exploration. The implication of the fact that an RBS sequence chosen at random will have low TIR (as shown by UNI and PPM) is that most of the exploration will result in low TIRs. These results confirm that RBSs with high TIRs tend to come from exploitation of the design space, whereas the exploration points give relatively low TIRs. Note that the exploration is necessary to expand our knowledge to the unknown parts of the design space and in effect allow us to exploit it better. Figure D shows the TIRs of RBSs tested in the Bandit groups divided into bins with width equal to a TIR ratio of 0.1. KDE plots have been overlaid to depict the calculated density for each group. The increase in prevalence of later Bandit groups in the higher bins is evident, especially for Bandit 2 and 3, constituting the bulk of results in the >0.8 TIR ratio bins. Notably, the distributions calculated for all the groups are bimodal—we discuss the possible reasons for that later in the text. In Figure E we show a t-distributed stochastic neighbor embedding (t-SNE)[37] plot depicting the experimental space. Each RBS is located on the plot according to its distance from other RBSs as calculated by our weighted degree kernel with shift (see Section ). The RBSs recommended by Bandit groups cover the majority of the design space. Additionally, a number of clusters were especially targeted by the recommendation algorithm. For example, the circled clusters labeled as “G-Rich Clusters” have been actively recommended by the algorithm. More specifically, sequences with 4 or more guanines in any position constituted 10% of the randomly selected sequences and 5, 9, 16, and finally 25% in each of the 4 Bandit guided batches, respectively.

Learn: Prediction of RBS Performance

In the Learn phase, we use a popular regression algorithm, Gaussian Process Regression (GPR). GPR has two benefits: (1) it provides the predicted TIR and the confidence interval needed for the UCB algorithm in the Design phase, and (2) it provides a natural way to calculate similarities between genomic sequences via kernels. Figure shows how our implementation of the Gaussian Process algorithm performed in terms of predictions (on a hold out test set) in each round. As expected, the use of approximate data led to poor predictions in Round 0. The predictions improved for the subsequent rounds, and the Spearman correlation coefficient rose from 0.269 for Round 0 to 0.546 for Round 3. We provide the evaluation of the Learn phase in terms of a ranking-based Spearman correlation coefficient, which has been shown to be a more suitable evaluation metric when the prediction is used for recommendation tasks than coefficient of determination (R2) or Pearson correlation coefficient.[38,39] The regression task in the DBTL cycle is more challenging than the large-scale data-based regression tasks. In the early iterations, we have a limited number of data points and relatively high variability due to the measurement noise. However, since the predictions are only used for recommendations in the Design phase, instead of precise predictions of the mean for each RBS, we only need to provide a valid ranking for both the predicted mean TIR and uncertainties, and thus the ranking of the UCB scores. This is because, in the Design part, we select RBSs based on the ranks of the TIR predictions instead of numerical predictions. This would ensure that for each round we would be testing the required designs, even if we are not exactly accurate in terms of their numerical characteristics. The performance of the Learn phase is also influenced by the exploration–exploitation balance in the Design phase. In each round, some data points were selected for exploration of the areas in which we have few or no tested data points. For those exploration points, since the predictor never learns their label distributions and patterns, there is no chance for the predictor to provide accurate predictions of TIR values, thus hurting the prediction performance in the short term. However, this is useful information for future predictions as it allows us to understand the whole underlying space instead of focusing on local suboptimal data points. In other words, we are intentionally sacrificing the accuracy of our predictions in each round to improve them in the future rounds. The effect of exploration of the space is the ability to find high TIR RBSs even with relatively low prediction performance.

Build and Test: Characteristics of the Tested Sequences

We present some important characteristics of the tested RBSs in Table , demonstrating the effectiveness of our Build and Test phases. Figure A shows the sequence logo calculated for the Top 30 sequences (Figure S9 shows the logo generated for all tested sequences). It is generally understood that guanine-rich sequences promote strong transcription. This expected bias toward guanine is clearly visible for all positions in our Top 30 RBSs. This result combined with the Bandit algorithm’s bias toward the G-rich cluster shown in Figure D reinforces the notion that our algorithm successfully identified G-rich sequences as the ones with higher probability of having a high TIR value.

Table 1

Characteristics of the Librarya

characteristics of the library	statistics
Total experimental space	4096
Planned constructs	450
Successfully constructed	445
Sequences with CV < 40%	79%
Sequences with CV < 20%	27%
Efficiency ratio of Bandit design (compared with random)	2
Raw TIR range	[4.93, 105.38]
TIR ratio range	[0.06, 1.34]

Figure 4

Characteristics of strong RBSs. (A) Sequence logo calculated for the Top 30 tested sequences. (B) Heatmap showing the edit (Hamming) distance required for positive change in TIR for RBSs with high and medium TIRs. The temperature scale shows the difference between a given RBS on the y-axis and the RBS with the strongest TIR at the given distance. Every second RBS is labeled for increased legibility.

Top table presents some of the characteristics of our library. Bottom table presents 10 RBS sequences with their corresponding TIR ratios; the first 9 are the strongest sequences including the benchmark sequence (AGGAGA) and the last is the Shine–Dalgarno sequence (AGGAGG). CV is coefficient of variation (standard deviation (STD) of a sample divided by its mean; see details in Figures S11 and S12). Efficiency of the Bandit design is calculated by dividing the highest TIR found using machine learning by the highest TIR found using random sequence generation. Characteristics of strong RBSs. (A) Sequence logo calculated for the Top 30 tested sequences. (B) Heatmap showing the edit (Hamming) distance required for positive change in TIR for RBSs with high and medium TIRs. The temperature scale shows the difference between a given RBS on the y-axis and the RBS with the strongest TIR at the given distance. Every second RBS is labeled for increased legibility. We discovered an interesting phenomenon of large edit distances between RBSs with high TIRs. In other words, given an RBS with high TIR, multiple sequence edits need to occur to reach another RBS with high TIR (we assume here that our data set includes most of the strongest RBSs for each distance–sequence combination). We define the edit distance as Hamming distance, that is, how many positions have to be changed to get from one sequence to the other (Hamming distance of 0 means that the sequences are identical and 6 means that they are two completely different sequences). Figure B shows the edit distance that is required for positive change in TIR for an RBS with TIR > 0.75. For RBSs with high TIRs (>1), the minimum distance that is required to increase the TIR is 2, with edit distances between 2 and 5 giving similar results. For RBSs with medium TIRs (<1), a distance of 1 is enough to produce a meaningful increase in TIR. This means that as the TIRs of examined RBSs increase, exploring sequences that are increasingly dissimilar to the current candidates tends to give more meaningful improvement. As long as this does not impact targeted methods like machine learning-guided design, this implies that the low rate of natural mutations will be very slow to explore more dissimilar sequences on such a short distance,[40] which indicates that methods like Adaptive Laboratory Evolution may not be able to find very strong RBSs with a limited budget. In other words, because the examined sequence is relatively short (6 bp in a wider 20 bp context) the time required to accumulate 2 or more changes in the RBS region required for meaningful increase in TIR might be prohibitively long (Figure A shows comparison between efficiency of random versus directed sequence generation). In such cases, a directed process, like the one described here, should be strongly encouraged. This observation is in line with approaches seen in other disciplines, e.g., protein engineering, where more directed changes yield better results than random changes.[41] Finally, while our strong sequences showed some affinity for the antisense sequence of the ribosome known to bind to RBS, they did not show any obvious secondary structures that could explain their TIRs (see Figure S10). This result combined with the unexpectedly bimodal nature of KDEs in Figure reinforces the notion, based on the previously reported literature,[14,16] that there may be a number of different mechanisms governing the probability of effective RBS-ribosome binding.

Discussion

In this work, we show how a machine learning guided approach and high-throughput, automated laboratory methods can be jointly applied to efficiently optimize a small genetic part, in this case maximizing the TIR of a bacterial RBS. In the Learn phase, we used Gaussian process regression to predict the TIR mean and uncertainty of that prediction. To represent RBSs and capture the similarities between them, we choose to use the Weighted Degree Kernel with Shift method, which fits well with Gaussian processes. In the Design phase we used an Upper Confidence Bound multiarmed Bandit algorithm to recommend sequences to be tested in batches. In the Build and Test phase, we performed our experiments using laboratory automation to increase their speed, reliability, and reproducibility. Using our proposed workflow and testing 450 RBS variants in 4 DBTL cycles, we designed and experimentally validated RBSs with high translation initiation rates equaling or exceeding the currently known strong RBSs in this genetic context by up to 34%. Furthermore, we have generated an extensive library of diverse RBSs that can be used as a basis for future studies. In the rest of the section, we first revisit our overall experimental goal and challenges encountered in each phase of the DBTL cycle (Figure ). We then link our proposed methods with related work and discuss the potential generalization of our framework. We further discuss our design choices and open questions in this line of research. We refer to ref (42) for a more detailed discussion. Our goal is to show the power of the machine learning guided DBTL cycle on RBS optimization. Our approach has shown that this combination of the two machine learning algorithms is able to correctly detect and exploit rules of biological design that otherwise require substantial time and experiments to uncover. We focus on the part-centric optimization,[27] which is an important task in synthetic biology. Understanding how to design the individual parts also allows us to extend the framework to better strain design,[23] where the part is optimized with a wider goal of strain optimization. The machine learning guided DBTL framework has good potential to be generalized to multigene pathway design as proposed by HamediRad, Chao et al.[43] and recently reviewed in ref (21). For example, the optimization goal can be adjusted to address combinatorial optimization for multigene and RBS scenarios; since this would be a large-scale data task, the current Gaussian Process regression prediction model could be updated to a deep Gaussian Process regression approach and the current Bandit algorithm could be optimized toward querying large design space to reduce computational complexity.[44] In the Design phase, we focus on maximizing the TIR of RBSs, by gradually moving our emphasis from exploration to exploitation as we progress through the design rounds. While maximization could be the appropriate payoff function for optimizing RBSs or other small parts, other payoffs may be better in more complex cases. For example, when considering multigene metabolic engineering, maximizing expression of individual genes may result in excessive metabolic burden, which could be achieved within the bandit framework by combining different goals into a multiobjective method.[45] We constrain our design space over 6-core parts of RBS sequences in this study; the design space will increase exponentially when designing a longer sequence. The computational complexity of calculating and sorting acquisition functions (e.g., UCB scores) can be reduced by discretizing the space adaptively and hierarchically.[46,47] In the Learn phase, our approach has correctly identified the correlation between high guanine content in the RBS and high TIR. We have achieved this despite the relatively low Spearman scores for our predictions. This observation of useful recommendations despite low prediction scores corroborates recent evidence from other studies.[20,48] Finally, our predictor Gaussian Process regression model, compared with previously described calculators using a deterministic thermodynamic approach, is able to show the uncertainty of the predictions, which can be used by our bandit algorithm to give better recommendations. To further test the performance of our predictions and recommendations, we have generated a TIR label for each of the 4096 RBS sequences using the RBS calculator[12] and used our approach to simulate the search for the strongest designs in limited number of rounds. We show these results in Supplementary Figure S3, but in short, our approach was able to identify above 75% of the sequences stronger than the benchmark in up to 2 rounds and 99% of them in up to 3 rounds. This surprisingly high efficiency can be partially attributed to the source of the TIR predictions, which being generated algorithmically are easier to predict by our algorithm than real world data. Nevertheless, it shows that the algorithm can efficiently learn the sequence to TIR relationship and exploit it subsequently in a recommendation task. In this study we have limited the number of design rounds to four. There were a number of reasons for this, including limitations on time and money, but also because the results obtained showed that we have achieved our goal of generating very strong RBS designs. There is a possibility that increasing the number of experimental rounds would enable us to improve the results further; however, this has to be put in the context of limited resources. For example, scanning the whole space would surely achieve the best results, i.e., would enable us to find the strongest possible RBS, but that would require unreasonable use of resources. Compared to solutions like the one reported by Hollerer,[27] our solution can be used when a high-volume method for data-generation is not available (for example, where there is no fluorescent readout available), while still providing the required results (optimized part). There are still open questions that need to be addressed for the application of machine learning in synthetic biology. First, we would like to understand how we can extract more biologically important information from the decisions made by our algorithms. We have shown that the algorithms are able to exploit them, but it will be important to create tools that will enable their reliable extraction from the results obtained. Second, given the small number of RBS sequences tested, how can machine learning algorithms provide more accurate predictions and uncertainty quantification? Third, the generalizability of the method is unknown. We believe that the method described here would be useful for designing other small genetic parts, but the complexity of the task quickly increases with the size of the analyzed sequence, so the method’s applicability might be impacted at some point. Similarly, the reusability of the obtained data set is currently unknown. The TIR of an RBS is dependent on its genetic context, but our Hamming distance and TIR impact of nucleotides outside of the core (Figures B and S5, respectively) studies indicate that as long as the changes in the genetic context are small, the obtained data set could serve as a basis for similar design efforts. For example, the data could be used to teach the algorithm for the first round of new designs, which in turn could decrease the number of rounds required to obtain the required characteristics. Finally, the practical optimal exploration–exploitation balance between rounds and samples is still an open question. In our work we have steered the recommendation process toward exploitation as we progressed through the rounds, in line with other works addressing this balance.[20] We have found our approach of bringing machine learning and synthetic biology experts together very fruitful. Synthetic biology is promoting standardized and normalized testing in biology and naturally pairs with machine learning, which can leverage the high quality biological data sets generated when the correct design rules are observed. The addition of machine learning to synthetic biology also adds an additional layer of scrutiny to the generated data sets through the advanced statistical methods that can be used to design and analyze the experiments. On top of that, the use of automation has helped us to produce more reliable results, which gave us the required confidence in our predictions and recommendations. We envision that pairing machine learning with high-throughput automation will keep delivering a high number of good quality data sets and improved methods for biological engineering. In the future we hope to extend the algorithm to other more complicated genetic elements, including promoters and terminators. However, it is important to reiterate that the experimental space grows exponentially with the number of examined positions, so the space becomes increasingly hard to cover with experiments, and so the percentage of the space that is measured will inevitably decrease, which might impact the expected recommendation results. To solve this problem, different algorithms or experimental techniques might be needed, but the general workflow can be reused.

Materials and Methods

Laboratory Experimental Design

Build: Construction of Genetic Devices

Plasmid Design

The pBbB6c-GFP plasmid was used for all our designs. This plasmid contains the GFP mut3b CDS, expression of which can be induced by the addition of isopropyl β-d-1-thiogalactopyranoside (IPTG) (Merck, Darmstadt, Germany, catalogue no. I5502). The original RBS for the GFP CDS was replaced using a combination of PCR and isothermal assembly. Primer sequences and the assembly strategy were generated using the Teselagen DESIGN software (Teselagen Biotechnology, San Francisco, CA).

PCR

PCR amplification of the cloning inserts was done using Q5 High-Fidelity 2X Master Mix (NEB, Ipswich, MA, catalogue no. M0492L). Twenty μL reactions were prepared by dispensing 1 μL of each 10 μM reverse primer into the wells of a 96-well PCR plate using the Echo liquid handler (Beckman Coulter, Brea, CA). A Mastermix consisting of polymerase premix, plasmid DNA template (pBbB6c, 5–10 ng per reaction), and the single 10 μM forward primer was prepared and dispensed using the FeliX liquid handler (Analytik Jena, Jena, Germany) or electronic multichannel pipet. Reactions were run using Touchdown PCR or standard PCR cycling methods in C1000 thermal cyclers (Bio-Rad, Hercules, CA). Capillary electrophoresis of PCR products was performed using the ZAG DNA Analyzer system (Agilent Technologies, Santa Clara, CA). Two μL of each PCR reaction was electrophoresed using the ZAG 130 dsDNA Kit (75–20 000 bp) or ZAG 110 dsDNA Kit (35–5000 bp) (Agilent Technologies, catalogue no. ZAG-110–5000; ZAG-130–5000). ZAG sample plates were prepared using the Sciclone G3 liquid handler (PerkinElmer, Waltham, MA). ProSize Data Analysis Software (Agilent Technologies) was used to generate gel images from the sample chromatograms, and amplicon sizes were estimated by reference to the upper and lower DNA markers spiked into each sample and a DNA ladder run in well H12 of each sample plate.

Isothermal DNA Assembly

Constructs were assembled using NEBuilder HiFi DNA Assembly Master Mix (NEB, catalogue no. E2621L). Reactions consisting of approximately equal amounts of the common fragment and the variable fragment were prepared using the FeliX liquid handler or electronic multichannel pipet, to a final volume of 5 or 10 μL. Assemblies were run in the thermal cycler for 1 h at 50 °C, followed by an infinite hold step at 4 °C. Finally, samples were incubated with of 50 nL of DpnI (NEB, catalogue no. R0176S) at 37 °C for 90 min to degrade any residual template DNA.

E. coli Transformation

The DH5α cell line (Thermo Fisher Scientific, Waltham, MA, catalogue no. 18265017) was made chemically competent using the Mix and Go E. coli Transformation Kit and Buffer Set (Zymo Research, Irvine, CA, catalogue no. T3001). Twenty μL of cells was aliquoted into each well of a cold 96-well PCR plate and stored at −80 °C for later use. Plates of cells were thawed on a −20 °C cold block before 3 μL of the assembly product was added and mixed using the FeliX liquid handler. Cells were incubated on a cold block for 2–5 min before being plated in a 96 (12 × 8) grid on Omnitrays containing LB (BD, Franklin Lakes, NJ, catalogue no. 244610) and 15 g/L agar (Merck, catalogue no. A1296) with 34 μg/mL chloramphenicol (Merck, catalogue no. C1919). Plates were incubated overnight at 37 °C. Cells were plated using the FeliX liquid handler.

Automated Colony Picking and Culturing

A PIXL colony picker (Singer Instruments, Roadwater, United Kingdom) was used to select individual colonies from the transformation plates using the 490–510 nm (cyan) light filter. Each selected colony was used to inoculate 1 mL of selective medium in a 2 mL square well 96 plate. They were then cultured overnight at 37 °C with shaking (300 rpm).

Glycerol Stock Preparation

100 μL of sterile 80% (v/v) glycerol (Chem-Supply, Gillman, Australia, catalogue no. GA010) and 100 μL of overnight culture were combined in the wells of a 96 deep (2 mL) round well plate using the FeliX liquid handler or electronic multichannel pipet They were then sealed with a 96-well silicone sealing mat and transferred to a −80 °C freezer.

Sequencing

Strains that gave GFP fluorescence intensity readings similar to that of the original RBS were selected for sequence confirmation by capillary electrophoresis sequencing (CES) (Macrogen, Inc., Seoul, South Korea). The strains transformed with each of the selected constructs were grown to saturation in 5 mL LB medium with chloramphenicol selection (34 μg/mL). Plasmids were extracted from the cultures using the QIAprep Spin Miniprep Kit (QIAGEN, Hilden, Germany, catalogue no. 27106) according to the manufacturer’s instructions. Plasmid concentrations were quantified using the Cytation 5 plate reader with the Take3Micro-Volume Plate (BioTek, Winooski, VT) and all fell in the range of 100–200 ng/μL. Samples of 20 μL of undiluted plasmid DNA were sequenced using a single primer (5′-CGATATAGGCGCCAGCAA-3′) that binds approximately 150 bp upstream of the RBS. Reads were aligned with the template sequence in the Teselagen software.

Test: Culture Analysis

Test Strain Culture

Overnight cultures (six biological (restarted every time from glycerol stock) replicates for each batch) were started by inoculating 1 mL of LB medium supplemented with 34 μg/mL chloramphenicol with 2 μL of the glycerol stock in a 96 deep (2 mL) round well plate, using the FeliX liquid handler or electronic multichannel pipet. Cultures were incubated at 37 °C with shaking (300 rpm) for 17 h. The following morning, 20 μL of each overnight culture was added to 980 μL of fresh selection medium, and these cultures were grown at 37 °C with shaking in a 96 deep (2 mL) round well plate. After 90 min, cultures were induced with IPTG to a final concentration of 0.5 mM. This was done by transferring 1.0 μL of 0.1 M IPTG to each well of a flat-bottom clear polystyrene 96-well plate using the Echo liquid handler, then adding 300 μL of culture to each well using an electronic multichannel pipet.

Microplate Spectrophotometry

The plates were monitored in the Cytation 5 microplate reader immediately after the addition of IPTG. Cytation 5 acquisition and incubation/shaking settings were as follows: length of run: 8 h; interval: 10 min; continuous orbital shake at 237 rpm and slow orbital speed; excitation wavelength: 490/10 nm; emission wavelength: 515/10 nm; bottom read; gain: 60; read height: 7 mm; read speed: Sweep.

Machine Learning Experimental Design

Two types of machine learning algorithms have to be applied to drive the experimental design workflow shown in Figure . One type of machine learning algorithm is a prediction algorithm (Learn), which helps us learn the function of TIR with respect to RBS sequence. The other type of machine learning algorithm is a recommendation algorithm (Design), which recommends RBS sequences to query (Test) in each round (in the sense of a single pass through the DBTL cycle) based on the predictions from Learn. In Round t, prediction and design are based on the results obtained in all previous rounds. Our implementation of the machine learning algorithms was tested in Python 3.6 and used the scikit-learn library.[49] In the following paragraphs, we describe our machine learning pipeline (which is applicable to design Rounds 1–3, denoted as Bandit 1–3) in the following order: data preprocessing, prediction and kernels, and finally recommendation. The pipeline for our zeroth round of machine learning designs, denoted as Bandit-0, is reported in Supplementary A.4.

Data Preprocessing

For each RBS sequence, we measured the TIR of 6 biological replicates. We measure GFP fluorescence of the culture in the log phase as it is the most representative one. We also measure the Optical Density at 600 nm (OD600) of the culture every 10 min, which reflects how dense (how many cells) the culture is. Denote our measuring time as t ∈ {tmin, ..., tmax}, where tmin is the start and end of the log growth phase. At tmin, the value is at its minimum. We choose the time range to be 4 h, i.e., tmax – tmin = 4 h. In our experiment, TIR is calculated as an averaged sum of GFP fluorescence divided by OD600 of the culture over time (calculated using values from all the measured data points), For label preprocessing, we first adjust the TIR values in each round using the round-wise reference values. The reference value is the TIR of the benchmark sequence that is run in triplicate in each round. Specifically, in Round t, we subtract the TIR mean of the benchmark RBS measured in Round t from all TIR values measured in the same round, for each replicate separately. We then normalized the data by performing a logarithm transformation and standardization on the adjusted TIR label for each replicate separately. After normalization, each replicate has zero mean and unit variance. Furthermore, we also normalized the kernel matrix used for prediction by centring and unit-norm normalization, which is reviewed in detail in Supplementary A.2.1.

Prediction: Gaussian Process Regression with String Kernel

To find RBS sequences with the highest possible TIR score after a total number of rounds N, we consider our experimental design problem as a sequential optimization of an unknown reward function , where is the set containing all RBS sequence points in the design space, and f(x) is the TIR score of the 6-base core sequence of the RBS x ∈ {A,C,G,T}6. In each Round t, we choose a set of m points and observe the function value at each point in the selected set , i.e., y = f(x) + ϵ, for all , where ϵ is the Gaussian random noise with unknown mean and standard deviation. For the regression model, we have used a Bayesian nonparametric approach called Gaussian Process Regression (GPR).[28,30,50] We model f as a sample from a Gaussian Process, which is specified by the mean function and the kernel function (also called covariance function) . GPR can predict the posterior mean of the RBS sequences in the design space. At the same time, it can provide the posterior standard deviation, which represents the level of uncertainty for the prediction. We review the predictions of GPR in Supplementary A.1. The choice of kernel function is critical for accurate predictions, since it controls smoothness and amplitude of the function to be modeled. For Bandit designs in Round 0, since we only had access to a limited number of data points from the literature, we chose to use one of the basic string kernels, the spectrum kernel(51) to process the core 6 bp and dot product kernel[28] (with one-hot embedding) to process the 7 bp flanking sequences both upstream and downstream of the core sequence. Among various string kernels, which have been studied in the literature,[52] and we will introduce briefly in Supplementary A.2, we chose the one with the best reported performance, the weighted degree kernel with shift (WDS),[52,53] to represent the RBS sequences in subsequent rounds and specify the kernel function of GP. WDS is a type of a string kernel, which takes two sequences as inputs and outputs a scalar value that represents the similarity between the two sequences. WDS kernel does this by counting the matches of substrings of a certain length (i.e., kmers) that constitute the sequence. The maximum substring length is specified by . The WDS takes into account the positional information by counting substrings starting from different positions, where the start position is specified by l. Additionally, the WDS kernel considers the shifting of substrings, with the maximum shift specified by s. This could be useful when there is a shift between two sequences. For example, two sequences ACCTGA and CCTGAA are in 1-shift. We now define WDS kernel. Let (A) be the indicator function, which equals 1 if A is true and 0 otherwise. Then indicates whether two substrings of length d match, between x starting from position l + s and x′ starting from position l. This is similarly done for . By having these two terms considering substrings of two sequences with starting positions differing by s characters, the WDS can measure shifted positional information. When s = 0, the kernel function counts the matches with no shift between sequences. Let x, x′ be two RBS sequences with length L; the WDS kernel is defined aswhere , and γ is a weighting parameter over the position in the sequence, where we chose to use a uniform weighting over the sequences, i.e., γ = 1/L (L = 20 in our design). S(l) determines the maximum shift at position l. Furthermore, we normalize the kernel with centring and unit-norm in terms of all sequences in the design space. The hyperparameters for kernel, including maximum substring length = 6, maximum shift length S(l) = 1, and the standard deviation of GP noise α = 1 were chosen on the basi of 10-repeat 5-fold cross validation. The values of hyperparameters considered were ∈ {3, 4, 5, 6}, S(l) = {0, 1, 2}, α = [0.5, 1, 2].

Recommendation: Upper Confidence Bound Multiarmed Bandit Algorithm

To recommend the RBS sequences to query in the next round, we have used the Upper Confidence Bound (UCB) batch algorithm,[9] which provides exploration–exploitation balance. On one hand, UCB exploits the function in terms of the design space, that is to pinpoint sequences that are believed to have high labels (i.e., high predicted mean); on the other hand, UCB explores the design space where we have little information and sequences have a chance to have high labels (i.e., high predicted standard deviation). More precisely, the UCB algorithm selects RBS sequences with the maximum upper confidence bound at Round t, i.e.,where β is a hyperparameter balancing the exploration and exploitation, μ(x), σ(x) are the predicted mean and standard deviation (that is uncertainty of the prediction) at Round t for the sequence x. We call μ(x) + βσ(x) the UCB score of sequence x at Round t – 1. Since experimentally labeling sequences is time-consuming, it is unrealistic to recommend sequences sequentially (i.e., one-by-one) and then wait for the label to be tested and used to improve the model. Instead, we can recommend RBS sequences in a batch of size m. However, this approach may end up recommending similar sequences in the same local maximum (e.g., x = 2, x = 2.5 in this example). Since GPR assumes similar sequences would have similar labels (e.g., by knowing x = 2 we can gain information about x = 2.5 as well), we prefer to not waste time and money on labeling sequences with high similarities in the same batch. To counter this, we use a batch recommendation strategy that is designed to avoid recommending such highly similar sequences in the same batch, described below. A key property of Gaussian Process regression is that the predicted standard deviation depends only on features, not on the labels. One can make use of this property to design a batch upper confidence bound (BUCB) algorithm.[29] The strategy here is to recommend RBS sequences one-by-one by sequentially adding the newly recommended RBSs’ predicted means (without testing them) into the training data and updating UCB scores. As illustrated in Figure B, the algorithm recommends the data point with maximum UCB score based on the predictions over the initial 5 observations. We then add the recommended data point (x = 2) into the training data set with the predicted mean of that point as the label (note it is not the true label), and update the predicted standard deviation, then we finally update the UCB scores. The second data point is then recommended based on the new UCB scores. Figure C shows that since we assume we have observed x = 2, the new predicted standard deviation of the data points in design space around x = 2 decreases, so instead of recommending a similar data point x = 2.5, the algorithm recommends x = 8, which is in another local maximum design area. In this way, the batch recommendation can potentially cover more local maximum areas than the sequential design. In summary, the exploration efficiency is improved since the recommended sequences in one batch will tend to be in different design areas so that the information gain is maximized. In our design, we have set the batch size to 90, to fit the experimental batch. Finally, we set β = 2 for Round 0–3 and β = 0 for the last round to allow for more exploitation.

Figure 5

Batch recommendation illustration. We use the batch size of 2, with 5 initial observations. The design space is 24 uniformly distributed points in the range [−2, 10], i.e., −2, −1.5, −1, ..., 9.5, 10. The confidence intervals are shown with predicted mean ±1.96 standard deviation. (A) Top UCB recommendations. The recommendations are 2 data points with top UCB scores, chosen with GP predictions. (B,C) Batch UCB recommendations. Panel (B) shows the first recommended sequence, (C) shows the new predicted confidence interval and the second recommendation based on that updated interval.

40 in total

1. The spectrum kernel: a string kernel for SVM protein classification.

Authors: Christina Leslie; Eleazar Eskin; William Stafford Noble
Journal: Pac Symp Biocomput Date: 2002

2. RBSDesigner: software for designing synthetic ribosome binding sites that yields a desired level of protein expression.

Authors: Dokyun Na; Doheon Lee
Journal: Bioinformatics Date: 2010-08-11 Impact factor: 6.937

3. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.

Authors: Babak Alipanahi; Andrew Delong; Matthew T Weirauch; Brendan J Frey
Journal: Nat Biotechnol Date: 2015-07-27 Impact factor: 54.908

4. An Automated Biomodel Selection System (BMSS) for Gene Circuit Designs.

Authors: Jing Wui Yeoh; Kai Boon Ivan Ng; Ai Ying Teh; JingYun Zhang; Wai Kit David Chee; Chueh Loo Poh
Journal: ACS Synth Biol Date: 2019-05-07 Impact factor: 5.110

5. Machine Learning of Designed Translational Control Allows Predictive Pathway Optimization in Escherichia coli.

Authors: Adrian J Jervis; Pablo Carbonell; Maria Vinaixa; Mark S Dunstan; Katherine A Hollywood; Christopher J Robinson; Nicholas J W Rattray; Cunyu Yan; Neil Swainston; Andrew Currin; Rehana Sung; Helen Toogood; Sandra Taylor; Jean-Loup Faulon; Rainer Breitling; Eriko Takano; Nigel S Scrutton
Journal: ACS Synth Biol Date: 2019-01-07 Impact factor: 5.110

Review 6. Machine learning for metabolic engineering: A review.

Authors: Christopher E Lawson; Jose Manuel Martí; Tijana Radivojevic; Sai Vamshi R Jonnalagadda; Reinhard Gentz; Nathan J Hillson; Sean Peisert; Joonhoon Kim; Blake A Simmons; Christopher J Petzold; Steven W Singer; Aindrila Mukhopadhyay; Deepti Tanjore; Joshua G Dunn; Hector Garcia Martin
Journal: Metab Eng Date: 2020-11-19 Impact factor: 9.783