Literature DB >> 35639727

AutoESD: a web tool for automatic editing sequence design for genetic manipulation of microorganisms.

Yi Yang^1,2,3, Yufeng Mao^1,2, Ruoyu Wang^1,2, Haoran Li^1,2, Ye Liu², Haijiao Cheng², Zhenkun Shi^1,2, Yu Wang², Meng Wang², Ping Zheng², Xiaoping Liao^1,2, Hongwu Ma^1,2,3.

Abstract

Advances in genetic manipulation and genome engineering techniques have enabled on-demand targeted deletion, insertion, and substitution of DNA sequences. One important step in these techniques is the design of editing sequences (e.g. primers, homologous arms) to precisely target and manipulate DNA sequences of interest. Experimental biologists can employ multiple tools in a stepwise manner to assist editing sequence design (ESD), but this requires various software involving non-standardized data exchange and input/output formats. Moreover, necessary quality control steps might be overlooked by non-expert users. This approach is low-throughput and can be error-prone, which illustrates the need for an automated ESD system. In this paper, we introduce AutoESD (https://autoesd.biodesign.ac.cn/), which designs editing sequences for all steps of genetic manipulation of many common homologous-recombination techniques based on screening-markers. Notably, multiple types of manipulations for different targets (CDS or intergenic region) can be processed in one submission. Moreover, AutoESD has an entirely cloud-based serverless architecture, offering high reliability, robustness and scalability which is capable of parallelly processing hundreds of design tasks each having thousands of targets in minutes. To our knowledge, AutoESD is the first cloud platform enabling precise, automated, and high-throughput ESD across species, at any genomic locus for all manipulation types.

Entities: Chemical

Year: 2022 PMID： 35639727 PMCID： PMC9252779 DOI： 10.1093/nar/gkac417

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 19.160

INTRODUCTION

Genetic engineering of organisms with desirable novel functionalities such as high productivity of valuable biochemicals is a key research objective of synthetic biology (1). Various genetic manipulation and genome engineering techniques have been developed for targeted deletion, insertion and substitution of DNA sequences in the genome of an existing organism (2,3). Among them, the scarless screening-marker-based homologous-recombination (HR) system and the RNA-guided CRISPR/Cas-based system are the most widely used ones (4,5). One common and important step in these techniques is the design of editing sequences to precisely target and edit DNA sequences of interest, such as the design of primers to locate and amplify the DNA fragment to be modified, or sgRNAs to guide the Cas enzyme to the exact site for cutting DNA in CRISPR/Cas-based system. Editing sequences design (ESD) is not trivial and can be error-prone for a number of reasons. First, due to the existence of various technological variants suitable for different organisms with diverse parameter configurations (e.g. lengths of homologous arms, preferences for carriers of donor DNA), different researchers might design different editing sequences even for the same manipulation objective, potentially leading to failure. Second, quality control steps, such as primers for verification of HR events and off-target risk assessment, are often overlooked, possibly resulting in editing sequences with low efficiency. Third, certain complex steps (e.g. DNA assembly for heterogenous fragment insertion) might introduce mistakes if multiple interdependent pairs of primers are required. Several computer-aided design (CAD) tools have been developed for ESD (6–8). For example, PrecisePrimer (9) and Primer3 (10) can perform primer design for DNA fragment amplification. CHOPCHOP (11) and gBIG (12) are developed to design precise guide RNAs for CRISPR/Cas system. J5 (13) and Raven (14) can assist the primer design for vector DNA assembly. Experimental biologists can employ multiple individual CAD tools in a stepwise manner to assist ESD, but this requires multiple software packages involving non-standardized data exchange and input/output formats. Instead, a one-stop service covering all the experimental steps is a better alternative and was shown to be effective with some recently developed tools. For example, GeneTargeter (15) and Yeastriction v0.1 (16) were developed for whole-process ESD in the case of gene knock-out using the CRISPR/Cas HR system in Plasmodium falciparum and Saccharomyces cerevisiae, respectively. GEDpm-cg (17) provides one-stop ESD for the construction of point mutations in the genome of Corynebacterium glutamicum based on two rounds of single crossover HR. However, these tools were developed explicitly for limited genetic manipulation types within particular species. Moreover, these tools can only choose genes as editing targets, limiting their application potential, since non-coding sequences such as promoters or ribosomal binding sites (RBSs) are also very important manipulation targets. To achieve target editing with any manipulation type (deletion, insertion or substitution) at any genomic locus across species, we chose the classical screening-marker-based HR (SMHR) system, which is a reliable genetic manipulation technique applicable in most microorganisms (5,18–20). For example, for two rounds of single crossover HR mediated by the suicide plasmid-mediated is the most widely-used HR system in C. glutamicum (20,21), while two rounds of cascade double/single crossover are preferred in Bacillus subtilis (5). For species with high transform efficiency, such as Saccharomyces cerevisiae (18) and Escherichia coli (19), donor DNA is often provided in the form of PCR products to avoid the costs of plasmid construction. Moreover, with the rapid development of automated strain engineering systems and emerging biofoundries (22–24), it has become physically feasible to construct thousands of strains within months or even weeks (25,26), increasing the demand for precise, automated and high-throughput design of editing sequences (7,27). In this paper, we introduce AutoESD (https://autoesd.biodesign.ac.cn/), the first cloud-based CAD tool for editing sequence design. Specifically, we developed flexible workflows to support various technological variants of SMHR. We integrated all editing sequence design steps covering the whole process of genetic manipulation using many techniques into a standardized one-stop service, resulting in an automated high-throughput in silico design platform. Moreover, AutoESD has an entirely cloud-based serverless architecture, offering high reliability, robustness and scalability. We believe that AutoESD combined with automated genetic engineering platforms will pave the way for an ultrafast design-build-test-learn synthetic biology cycle in the future.

RESULTS

Genetic manipulation principles and computational design workflows

In most microorganisms, the SMHR system is a reliable genetic manipulation technique in which two rounds of HR events occur with the help of homologous arms. The screening markers will be integrated into the genome in the first HR event for screening, then removed in the second HR event, enabling the scarless editing of the target genome (4,5). According to the type of HR events (single or double crossover) and the form in which the homologous arms are provided (plasmid, PCR fragment, or oligonucleotide), we included five major technical variants in this study, namely plasmid-mediated single/single crossover, plasmid-mediated double/single crossover, fragment-mediated double/single crossover, fragment-mediated double/double crossover and oligonucleotide-mediated double/double crossover HR (Figure 1A, Supplementary Figures S1–S4 and S6–S12).

Figure 1.

(A) Genetic manipulation principle; (B) computational design workflow of AutoESD.

(A) Genetic manipulation principle; (B) computational design workflow of AutoESD. Taking the plasmid-mediated double/single crossover as an example (Figure 1A), if a 500 bp fragment is to be inserted into the genome of Bacillus subtilis, the whole procedure can be divided into three major steps. First, we need to assemble the intermediate sequence, which contains the upstream homologous arm (UHA), inserted sequence, screening marker, and downstream homologous arm (DHA). This requires the design of primers for UHA (P1/P2), DHA (P3/P4), screening marker (P5/P6), and inserted sequence (P7/P8). Second, the screening marker and the inserted sequence will be integrated into the genome in the first HR event for screening, followed by the use of testing primers (TP1/TP2) to verify the first HR event and confirm that the sequences were actually integrated into the genome. Third, the fragment with screening markers will be removed from the edited genome, after which another pair of testing primers (TP3/TP4) is required to confirm the final editing result. In this study, we standardized the experimental process into a computational design workflow and coordinated functions corresponding to the dependencies in the experimental process as shown in Figure 1B: Input: The workflow starts with an input handling model that processes the input file for target manipulation in a format similar to VCF (28), enabling users to integrate different types of sequence manipulations in a single file. Within this file, ‘Sequence upstream of the manipulation site (>100bp)’ rather than exact position or gene ID is used to locate the manipulation site in the given genome due to the following two advantages. First, for non-standard genome (e.g. a laboratory evolved strain) or unsequenced genomes, it is difficult to determine the position, but it is often easy to find the sequences around the site. Second, using the sequence instead of gene ID allows us to manipulate at any site, including noncoding regions such as promoters as well as coding sequences. Similarly, we also use sequences in the input file to indicate reference sequences and inserted sequences. An empty inserted sequence (represented as ‘-’ in the input file) implies a deletion, and an empty reference sequence implies an insertion. In this way, we can include different types of sequence manipulations (deletion, insertion or substitution) in a single input file. Moreover, for some cases, such as deletion of a gene's complete CDS, it is easy to obtain the exact chromosomal position from the genome annotation file (e.g. in GFF or GTF format). AutoESD also supports input files which include the exact position of manipulation site (a combination of chromosome, position and strand). Assembly: Four primer design modules are needed for designing P1–P8 to obtain the assembled plasmid (Figure 1A). There are interdependencies between the primers (e.g. 20 bp overlaps between P2 and P5, 20 bp overlaps between P3 and P6). The major difference among these technical variants is that the required modules/parameters are not the same and the interdependencies between modules are slightly different (Figure 1B), making it quite difficult for non-expert users. For example, the screening marker primer design is unnecessary for plasmid-mediated single/single crossover, while the UHA/DHA primer design modules are not required for oligonucleotide-mediated double/double crossover. First HR for integration: This step integrates the foreign sequences into the target site in the genome. Two technical variants: single crossover or double crossover, can be chosen according to their suitability for the chassis organism. Testing primers (TP1/TP2) need to be designed to verify the first HR event and confirm that the sequences were actually integrated into the genome. This step is often overlooked by inexperienced biologists, leading to failure at a later stage and a waste of time. Second HR for marker removal: This step removes the screening marker sequences from the edited genome for a scarless genome manipulation. Similar to the first HR, single or double crossover can be chosen. However, the options may be restricted by the first HR event. Another pair of testing primers (TP3/TP4) is required to be designed to confirm the final editing result. After designing all the editing sequences, off-target evaluation of the designed homologous arms is performed to help end-users assess the potential risk and adjust parameter settings if needed. In addition, detailed visualization of editing sequences is provided for the convenience of end-users doing quality checks online before placing the primer order. Although there are many differences in terms of technical variants, the modularized algorithm design approach allows us to assemble logic functions like building LEGO, generating a similar overall workflow (Figure 1B). By designing the underlying data structure and modules, we construct separate functions for all modules. Notably, one single function is created for each module to support all five different technical variants by different parameter settings. In this way, the whole workflow enables on-demand one-stop editing sequence design across technical variants, supporting various genetic manipulations across species. The core logical code of AutoESD is available on GitHub (https://github.com/tibbdc/autoesd).

Cloud platform implementation based on Amazon web services

Based on the workflow shown in Figure 1B, we have written several functions in python for different design modules and built a user-friendly web server for the complete design workflow using cloud-based serverless architecture, named AutoESD. This enables biologists without coding experience to use it for reliable, robust, and automatic editing sequence design. We used a three-tier architecture (the front presentation tier, logic computation tier, and data storage tier), as shown in Figure 2, to build our web server on Amazon Web Services (AWS). The data storage tier manages persistent storage of our platform, including AWS DynamoDB and AWS S3 for storing user-uploaded input files, parameters and jobs. The front presentation tier represents the components users directly interact with, which is hosted by the AWS S3 static website functionality and accelerated by AWS CloudFront. The seqviz package (https://github.com/Lattice-Automation/seqviz) was used to provide customizable visualization of design results, which can help end-users check the design results online. The logic computation tier manages requests from external systems and contains the core services such as AWS Lambda, AWS API Gateway and AWS Step Functions. The API Gateway handles the HTTP requests and routes them to the correct backend. AWS Step Functions orchestrate the serverless workflow by processing messages from the API Gateway and invoking AWS Lambda asynchronously. AWS Lambda provides core computation functionality with main functions to handle different requests. More specifically, the Lambda function gets the corresponding input files and parameters from AWS S3, and calls the workflow function in python to obtain the design results, which will be uploaded to AWS S3 as well.

Figure 2.

The software architecture of AutoESD.

The software architecture of AutoESD. In our platform, uploaded input files will be stored on an AWS S3 bucket. When the submit button is clicked after all parameters are set, a request is sent to the API Gateway, which passes the parameters to the AWS Step Function, and all parameters are stored on AWS DynamoDB. Then, the browser gets the response, jumps to the results page, and waits for the computing results. AWS Lambda is invoked asynchronously by AWS Step Function event sources, which runs the logic code and uploads the result files to AWS S3. Each submission will trigger a parallel computing process, regardless of how much demand there is on the website, showcasing the great scalability of serverless computing. To further show the robustness of our platform, we perform a large-scale simulation test. We simulated a group of in silico experiments, including 100, 200, 300, 400, 500, 800, 1000 and 1500 concurrent design requirements, each containing 2000 genetic manipulation targets. The results showed that the overall running time of the platform is not correlated with the number of tasks from 100 to 800 (Supplementary Figure S5). The total time starts to increase as the number of tasks exceeds 1000, and this is because AWS sets a concurrency limit of 1000 for each user-defined Lambda function. Moreover, it is worth noting that user inputs and outputs are only visible to users themselves, and will be stored on AWS server for no more than one week.

Case study

B. subtilis is one of the most important industrial protein producers and a promising generally recognized as safe (GRAS) chassis for the biosynthesis of various biochemicals (29,30). Here, we show an example on how AutoESD can be used to design editing sequences for engineering B. subtilis to produce meso-2,3-butanediol (2,3-BD), a platform chemical and promising alternative biofuel (31). Six manipulations (deletion of bdhA, acoA, pta, ldh; substitution of the promoter for the precursor biosynthetic operon alsSD with a strong promoter P, and insertion of the heterologous budC gene under the control of P at the amyE locus) have been reported as effective engineering strategies for 2,3-BD production (31). To design editing sequences for these manipulations, users first need to prepare the input file including sequence data related to these genes (Figure 3A and Supplementary file 1). As mentioned before, we used sequences to locate the manipulation site. It should also be noted that both the sequences on positive and negative strands are acceptable. Still, the sequences for any single manipulation target should be provided in the same direction. Specifically, although the genes bdhA and pta are located in the antisense strand, the ‘Sequence upstream of the manipulation site (>100 bp)’ is still the upstream sequence if we follow the direction of the respective open reading frames of bdhA and pta. Similarly, the ‘Reference sequence’ is also the coding sequence for these two genes. For the substitution of the promoter sequence of the alsSD operon, which is also located on the antisense strand, the ‘Inserted sequence’ should be the P sequence following the same direction as the alsSD operon. For the heterologous budC gene to be inserted into the amyE locus on the positive strand, the ‘inserted sequence’ is the P sequence and its coding sequence. By including multiple sequence manipulations in one input file, AutoESD allows the users to automatically design editing sequences covering different manipulation types at arbitrary genomic loci, encompassing more than the CDS region.

Figure 3.

Online operation flow for running AutoESD. (A) Submission; (B) job manager; (C) design results; (D) detailed visualization.

Online operation flow for running AutoESD. (A) Submission; (B) job manager; (C) design results; (D) detailed visualization. When using AutoESD, end-users need to clarify the targeted B. subtilis strains by selecting/uploading the genome sequence, the specific HR variant by uploading the linear plasmid sequence, and the screening marker sequence (Figure 3A). In this showcase, we selected one variant from two rounds of double/single crossover HR systems, which is dependent on the suicide vector pSS and counter-selectable marker upp (encoding uracil phosphoribosyl transferase) (32). For this technical variant, the endogenous gene upp is usually deleted in advance to enable its counter-selection. Accordingly, the genome of B. subtilis 168 Δupp was chosen as an example. Then, users need to sequentially select ‘Plasmid-mediated’, ‘Double’ and ‘Single’ on the website (Figure 3A) and upload a linear pSS sequence to clarify this variant (Supplementary file 2) and the ‘Target Manipulations’ file (Supplementary file 1). Once all four inputs are uploaded and submitted, AutoESD will perform an initial input check and then invoke the downstream ESD workflow. The results, including all the designed primers, can be downloaded or checked online (Figure 3B). Notably, if the user prefers a rapid identification of the effect of engineering targets on the 2,3-BD production, they can change the option of ‘Screening marker removal’ from the default ‘Yes’ to ‘No’ to obtain the simplified designed results without those needed in the second HR event (Figure 3A and Supplementary Figure S9). AutoESD provides an interactive sequence visualization interface for checking target manipulation, designed primers, and other detailed information (e.g. assembled recombinant plasmid, two rounds of verified PCR products, the location of all primers) (Figure 3C and D). To make the genetic manipulation more traceable, the genome sequence FASTA file after any single manipulation can also be downloaded. In addition, to avoid potential off-target events, the output file (evaluation result) provides the target with potential off-target risk as the UHA/DHA can also be aligned to other regions in the target genome. The user can try to lower the risks of off-target events by altering the lengths of the UHA/DHA in the parameter settings. Moreover, to improve the reusability of primers, user-specified test primers, which have been typically used in their labs, are allowed to be uploaded (Figure 3A and Supplementary Figure S6). These design and quality check functions show that AutoESD provides reliable automated editing sequence design for all the manipulation types at both coding and intergenic regions. It is worth mentioning that plasmid-mediated single/single crossover HR is also a widely used technique in B. subtilis. This can be implemented using the suicide plasmid pCU, which carries upp as the the counter-selectable marker and the cat (chloromycetin resistant gene) as a direct selection marker (31). For the same engineering targets for 2,3-BD biosynthesis, end-users only need to change the first crossover from ‘Double’ to ‘Single’ on the website and change the input file of the plasmid sequence to the linear pCU sequence (Supplementary file 3). The options to choose different crossover techniques allow the user to select suitable HR variants for different species, enabling the application of AutoESD across different organisms. We tested AutoESD in five species using all technical variants with thousands of in silico generated genetic manipulations covering all the types of genetic manipulation in the corresponding genomes (all high-throughput input examples are provided on the website). All the design tasks were completed within minutes (Table 1), indicating the high-throughput editing sequence design capacity of AutoESD.

Table 1.

Running time of AutoESD for high-throughput tasks across species

Strain	Number of target manipulations in one submission	Technical variant	Total running time (s)
Bacillus subtilis 168	1000	Plasmid-mediated double/single crossover	224
Corynebacterium glutamicum ATCC 13032	1000	Plasmid-mediated single/single crossover	196
Escherichia coli BL21(DE3)	1000	Fragment-mediated double/single crossover	208
Saccharomyces cerevisiae S288c	2000	Fragment-mediated double/double crossover	424
Escherichia coli K-12 MG1655	2000	Oligonucleotide-mediated double/double crossover	387

Running time of AutoESD for high-throughput tasks across species

DISCUSSION

In the synthetic biology era, the massive scale of trial-and-error experiments in engineering organisms goes far beyond the traditional labor-intensive research paradigm, motivating the development of revolutionary biofoundries (22,24). ESD automation will play an indispensable role in bridging the model-based system-level genotype design step (33) with robot-assisted high-throughput strain construction steps in an automated biofoundry (7,15,27,34). In this spirit, we developed the web tool AutoESD for the precise, automated and high-throughput design of editing sequences required for the SMHR mediated genetic manipulation of genome engineering targets. Compared with those ESD tools developed for specific genetic manipulation types in one or a few species (Supplementary Table S1), AutoESD standardized and modularized the experimental process into a computational design workflow (Figure 1B). As a result, AutoESD can easily support one-stop ESD for all the manipulation types, at any genomic loci and across species using various SMHR variants from one input file including information for all required manipulations through a user-friendly web interface (Figure 3A). Moreover, AutoESD was built based on the cloud-based serverless architecture. In this way, we can build and run applications and services without provisioning or managing servers. This architecture has also been adopted in some other studies (35,36), showing its high reliability, robustness and scalability. AutoESD can support hundreds of end-users simultaneously submitting design jobs with over 2000 genetic manipulation targets per job, and all the jobs can be processed in parallel in minutes. To the best of our knowledge, AutoESD is the first tool that supports ESD covering most genetic manipulation demands of synthetic biology and metabolic engineering. We believe that AutoESD combined with robotic genetic engineering platforms will pave the way for an accelerated design-build-test-learn synthetic biology cycle in the future biofoundries. The SMHR system is chosen due to two major reasons: (i) it is a mature and reliable genetic manipulation technique in most microorganisms, and can be rapidly applied to those newly isolated non-model organisms; (ii) it can theoretically support the editability of targeted genome without the risk of introducing additional mutations/scars, and off-target risks can be avoided with an appropriate length of HR arms. At present, AutoESD can assist most experimental biologists with routine strain construction demands using SMHR variants suitable for their species. What's more, AutoESD can also assist biologists trying new technical variant to further improve editing efficiency. In addition to SMHR, there are also genetic manipulation techniques, such as CRISPR/Cas-based HR (16), base editing (25) and prime editing (37), as well as other DNA assembly methods such as type IIs endonuclease mediated Golden Gate method (38). Next step, we will include more technical variants, assembly methods in our design tool and eventually help biologists to achieve their functional manipulation goal.

DATA AVAILABILITY

The core logical code and sample dataset of AutoESD are available on GitHub (https://github.com/tibbdc/autoesd). Click here for additional data file.

38 in total

Review 1. Design Automation in Synthetic Biology.

Authors: Evan Appleton; Curtis Madsen; Nicholas Roehner; Douglas Densmore
Journal: Cold Spring Harb Perspect Biol Date: 2017-04-03 Impact factor: 10.005

2. FastPCR: An in silico tool for fast primer and probe design and advanced sequence analysis.

Authors: Ruslan Kalendar; Bekbolat Khassenov; Yerlan Ramankulov; Olga Samuilova; Konstantin I Ivanov
Journal: Genomics Date: 2017-05-12 Impact factor: 5.736

3. j5 DNA assembly design automation software.

Authors: Nathan J Hillson; Rafael D Rosengarten; Jay D Keasling
Journal: ACS Synth Biol Date: 2011-12-20 Impact factor: 5.110

Review 4. Engineering Cellular Metabolism.

Authors: Jens Nielsen; Jay D Keasling
Journal: Cell Date: 2016-03-10 Impact factor: 41.582

Review 5. Needs and opportunities in bio-design automation: four areas for focus.

Authors: Evan Appleton; Douglas Densmore; Curtis Madsen; Nicholas Roehner
Journal: Curr Opin Chem Biol Date: 2017-09-15 Impact factor: 8.822

6. Automated Rational Strain Construction Based on High-Throughput Conjugation.

Authors: Niklas Tenhaef; Robert Stella; Julia Frunzke; Stephan Noack
Journal: ACS Synth Biol Date: 2021-02-16 Impact factor: 5.110

Review 7. Advances and prospects of Bacillus subtilis cellular factories: From rational design to industrial applications.

Authors: Yang Gu; Xianhao Xu; Yaokang Wu; Tengfei Niu; Yanfeng Liu; Jianghua Li; Guocheng Du; Long Liu
Journal: Metab Eng Date: 2018-05-21 Impact factor: 9.783