Literature DB >> 17882154

Optimal encoding rules for synthetic genes: the need for a community effort.

Gang Wu, Laura Dress, Stephen J Freeland.

Abstract

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Oligonucleotides

Year: 2007 PMID： 17882154 PMCID： PMC2013922 DOI： 10.1038/msb4100176

Source DB: PubMed Journal: Mol Syst Biol ISSN： 1744-4292 Impact factor: 11.429

× No keyword cloud information.

A paradigm shift is underway within the methodology of heterologous protein expression. Specifically, researchers are moving away from conventional techniques of cloning genes from cDNA libraries and moving toward the rational design and de novo synthesis of entire protein-coding sequences from pre-annealed oligonucleotides (Libertini and Di Donato, 1992; Gustafsson ). It was the invention of polymerase chain reaction (PCR) that allowed efficient construction of synthetic genes. Since then, the steadily increasing accuracy and decreasing cost of oligonucleotide synthesis (now as low as $0.10 per base; Carlson, 2003; Carr ; Kong , see Figure 1) has created a research environment in which gene synthesis offers three main advantages over molecular cloning: cost efficiency, scope and flexibility of redesign (Libertini and Di Donato, 1992). As a result, the emerging field of synthetic biology is highly motivated to improve this approach, as it seeks to expand the sophistication of human-engineered genetic architectures, leading ultimately to the synthesis of entire genomes (Yount ; Smith ).

Figure 1

The cost, per base, of commercial oligonucleotide assembly from 1999 to 2006. The price of gene synthesis has decreased almost 30-fold in the past 7 years (data for years 1999–2003 are taken from Carlson, 2003). Data for years 2004–2006 reflect the lowest price found in advertisements placed within Science magazine).

Current research into synthetic gene construction has focused largely on improving PCR-based methods. Areas under active investigation include the following: increasing the accuracy of gene products by reducing errors in oligonucleotide construction and PCR synthesis/amplification (Ciccarelli ; Young and Dong, 2004), reducing the relatively high cost of post-synthesis sequencing (Young and Dong, 2004), increasing the length of genes that can be synthesized (Kodumal ), developing microchip-based technology and/or microfluidic devices that allow for the simultaneous assembly of multiple genes (Tian ; Zhou ; Kong ), and automating the whole pipeline from gene design to synthetic gene screening (Cox ). All frontiers show signs of rapid improvement (e.g., Xiong ; Engels, 2005; Wu ), therefore the current challenges for gene synthesis are essentially optimizations of existing concepts. In stark contrast, it appears that we have much left to learn when it comes to the conceptual design of gene sequences. A significant fraction of the biologically and commercially important genes that have been redesigned report little or no success in increasing protein expression (e.g., see Alexeyev and Winkler, 1999; Flick ; Wu ; Hillier ). More surprising, some of these ‘improvements' have led to a direct and observable reduction in protein production (Griswold ). Even those that do report increased protein yield require careful scrutiny, because many have not controlled for altered mRNA levels in their system (e.g., Deng, 1997; Alexeyev and Winkler, 1999; Feng ; Humphreys ; Nalezkova ). Thus, although excellent progress in the practice of gene synthesis enables experimental implementation of the technique, the scientific community remains far from a complete understanding of what constitutes a rational design strategy for a protein-coding gene. Instead, the very concept of a ‘translationally optimal codon' has grown to incorporate dimensions of translational speed, translational accuracy and sustainability of yield that could vary from one experiment to another. Meanwhile, we have learned that a codon's position within a coding sequence, its ‘neighborhood' of other codons, its structural role within the mRNA sequence and the nature of the genomic system in which it is to be expressed can all influence the effects of ‘synonymous' codon choices. Given that we can physically construct any gene, what rules define the appropriate sequence to manufacture? Here, we examine current progress and emerging challenges in both theory and practice, showing how this topic exemplifies the interdisciplinary challenges of 21st century biology.

Why redesign the coding sequence?

Modern expression vectors have undergone extensive manipulation to maximize mRNA transcription. Yet a relatively weak correlation can exist between expression levels of mRNA and those of translated protein products (e.g., Futcher ; Nie ). Thus, it is now widely understood that persistent poor expression of protein product can result from problems occurring at a post-transcriptional stage, especially at the point of translation (Kurland and Gallant, 1996; Gustafsson ). The issue here is that the ‘digital' portrayal of translation found in biology textbooks oversimplifies a bio-mechanical process in which different populations of tRNAs essentially compete to translate an appropriate codon of mRNA within the context of a ribosome (e.g.,Rodnina ). Different organisms can vary enough in their relative contents of isoaccepting tNRAs to change the dynamics of this competition, such that different choices from a suite of synonymous codons can influence the speed and accuracy of translation. For this reason it can be useful to redesign a protein-coding sequence to suit its new context when moving it between genomes.

What should we build? The theory of synthetic gene design

The most direct method to find an optimal encoding for heterologous expression would be to comprehensively screen all possible alternative sequences. This is however impractical for sequences of any appreciable length because of the near-infinite encoding possibilities: approximately 3.7 × 1021 different nucleic acid sequences could encode a single peptide comprising 150 amino acids, thus top-down screening procedures must be guided by bottom-up gene design. To this end, a wealth of software has been developed to help bench scientists achieve reverse translation (Arentzen and Ripka, 1984; Mount and Conrad, 1984; Danckaert ; Pesole ; Presnell and Benner, 1988; Weiner and Scheraga, 1989; Bains, 1990; Tamura ; Libertini and Di Donato, 1992; Makarova ; Nash, 1993; Raghava and Sahni, 1994; Withers-Martinez ; Hoover and Lubkowski, 2002; Fuglsang, 2003; Gao ; Grote ; Jayaraj ; Richardson ; Villalobos ; Wu ; Puigbo ). Broadly speaking, this software can be divided into two categories according to algorithmic purpose: one seeking gene designs that facilitate empirical sequence manipulations, the other seeking designs that translate well into protein products. Perhaps the two most salient features of this software are the diversity of opinion as to what rules will optimize translation and a general lack of awareness by each software solution that numerous competitors exist (Figure 2).

Figure 2

The gene design software shown as a network of citations by date of publication. Each of the 37 nodes represents a specific software application for gene design; arrows indicate acknowledgements (citations) of pre-existing, published software. Appropriate web development would alleviate this patchy awareness of competing efforts, and could eliminate inefficient and needless duplication of effort. Abbreviations: BBOCUS: BackTranslation Based On Codon Usage Strategy; BT: BACKTR (Pesole ); bt: backtrans (Mount and Conrad, 1984); CO: Codon Optimizer (Fuglsang, 2003); CODOP: CODon OPtimization (Withers-Martinez ); DB: DNA Builder (Pacific Northwest National Laboratory); DIROM (Makarova ); DW: DNAWorks (Hoover and Lubkowski, 2002); EB: EasyBack (University of Catania, Italy); G.D: Gene.Design (Weiner and Scheraga, 1989) G_D: Gene Designer (Villalobos ); G_Dn, GeneDn (Ju ); GC: Gene Composer (by Emerald Biosystems); GD: GeneDesign (Richardson ); ged: gene design (Presnell and Benner, 1988); GeMS: Gene Morphing System (Jayaraj ); GF: The Gene Forge (by AptaGen LLC); GMAP (Raghava and Sahni, 1994); GO: GeneOptimizer (by GeneArt, Germany); IBG: IBG GeneDesigner (Vogelbacher ); JCat: Java Codon Adaptation Tool (Grote ); LBT: Locally Sensitive BackTranslation; Leto (by Entelechon Inc.) OG: OptGene (by Ocimum Biosolutions); P2D: Protein2DNA (by DNA 2.0 Inc); PGen: PrimerGen (Nash, 1993); PINCERS (Tamura ); PO: Primo Optimum (by Chang Bioscience); RESTRI (Libertini and Di Donato, 1992); RT: Reverse Translate (Danckaert ); SGD: Synthetic Gene Designer (Wu , 2006b); SMS: Sequence Manipulation Suite (Stothard, 2000); THOYO (Bains, 1990); TIP: Traducción Inversa de Proteínas/Protein Backtranslation (Moreira and Maass, 2004); U1: unnamed1 (Arentzen and Ripka, 1984); U2: unnamed2 (Danckaert ); UpG: UpGene (Gao ). Applications shown in dashed lines indicate software that has never appeared in peer reviewed scientific literature.

So where should we seek guidance as to the rules of optimal encoding? Up until now, the overwhelming majority of synthetic genes that have been reported in peer reviewed literature represent unique attempts to re-engineer different genes. As a result, their collection into a single database (Wu ) currently presents a ‘broad and shallow' scatter of isolated points in sequence space. A shift in research emphasis is needed to refocus efforts toward a ‘narrow and deep' systematic comparison of different recoding strategies for a few genes. Meanwhile, the nearest we have to such a dataset are the numerous gene variants produced by evolution. It has been long recognized that codon usage frequency appears to be unequal for most synonymous codons within naturally occurring genomes (Grantham ). Much of this bias is a passive reflection of the mutation biases at work in a genome (Sharp ; Knight ), however it can be tricky to ascertain which features of which sequences have been shaped by natural selection. Not only do precise predictions from evolutionary theory rely on parameters that we may never know with certainty, but the noise to signal ratio implicit within any ‘naturally optimized' sequence can confound the most careful analyses.

Where to next? Specific objectives for future progress

Although the major unknowns of synthetic gene technology are mostly those of design theory, the current problem is an excess, and not deficit, of ideas. Major progress thus seems poised to occur when empirical studies start to compare these ideas systematically. An important step would be to standardize experimental protocols and reports so that the emerging patchwork of results can be examined as a coherent whole. Specifically, experiments must standardize their measurement of mRNA expression levels for the target genes (as a baseline for interpreting protein yields), and measure protein production in absolute rather than relative terms (e.g., mg/l or percentage of total protein rather than ‘n-fold increase/decrease') if they are to be compared. A further step would be to identify one or more standardized (model) experimental systems for use by any and all research groups that are willing to share information. An ideal expression system would not be pre-engineered in any way that could confound interpretation of results (e.g., by containing enriched tRNA pools), it would employ a protein product that is amenable to clear, quantitative assay and could include an internal control (such as a dual reporter system in which only one gene has been redesigned) to add further confidence to measurements of protein yields. The idea of standardization extends into the philosophy of bioinformatics software that predicts gene design. Current software typically requires a combination of logically independent gene optimizing steps as a mandatory, pre-packaged whole. This renders the comparison of results difficult and suggests the need for secondary design algorithms designed to isolate specific gene features (e.g., changing codons while maintaining overall GC content, or varying GC content while maintaining RNA structural motifs). It is noteworthy that the underlying nature of all gene design software is similar and simple: a user must input a protein sequence and a genetic code. The protein sequence is then reverse translated into a nucleotide sequence using one or more algorithms, and the resulting nucleotide sequence is returned to the user. Independent applications must duplicate at least this much functionality. A promising direction of future software development in this field would be an emphasis on integration into a unified, distributed, modular web service for synthetic gene design. Specifically, programmers could take advantage of purpose-built web technologies, such as XML (a data sharing language) and SOAP (a language for wrapping independent applications), to facilitate interconnection of disparate, pre-existing software. New algorithms could be added as pathways through which a synthetic gene might travel en route to final design. This would provide users with a common interface through which they could choose the specific algorithm(s) to use at each step of synthetic gene design. Far from restricting the diversity of independent ideas for design services offered by different groups (on different web-servers), this type of coordination through a common interface would focus attention where it belongs: on the overlapping (and sometimes directly competing) concepts of how to design genes for optimal expression.

Critical assessment

Our suggested shift in research emphasis toward standardized protocols and integration of competing design strategies would create a foundation with potential that exceeds the capabilities of any one group or traditional collaboration. How then can the diverse interests of those interested in synthetic gene design be harnessed into a common framework for progress? We advocate the introduction of a competitive model, similar to the CASP approach that has been used within the protein folding research community (Moult, 2005). Given a standardized experimental protocol, it would be possible to pick genes of major research interest that are proving problematic for heterologous expression. For example, a recent study of Plasmodium falciparum, the causative agent of the most deadly form of malaria, reported that ‘12 targets, which did not express in Escherichia coli from the native gene sequence were codon-optimized through whole gene synthesis, resulting in the expression of three of these proteins' (Mehlin ). Presumably, malaria researchers would be motivated to call for theoretical predictions of redesign that could help their situation. Theorists and software developers should in turn be motivated to demonstrate their algorithms' worth as the marketplace of redesign ideas becomes increasingly saturated, and those who research the optimization of gene assembly protocols (regardless of sequence content) would be motivated to absorb a significant fraction of the effort required for synthesizing these predictions. The net result would be a distributed (community wide) version of the direct screening approach favored by early pioneers of synthetic gene technology (Stemmer ; Humphreys ), in which each segment of the community directly benefits from a united focus. If all designs were deposited within the SGDB (Synthetic Gene Database) (Wu ), then this could quickly transform the knowledge base for synthetic gene technology. Fortunately, recent advancement in multiplex gene synthesis technology has implied the feasibility of simultaneous synthesis of thousands of genes for large-scale experimental tests (Tian ; Zhou ; Cox ; Kong ), so the potential for large-scale comparison of predictions may be nearer than we think. This is an ambitious vision, but the motivation is strong. Current synthetic gene technology offers the potential to become a foundational tool of systems biology. However, until we know how to optimize coding sequences, we cannot construct a single synthetic gene with confidence, let alone produce a whole synthetic genome.

60 in total

1. Codon optimizer: a freeware tool for codon optimization.

Authors: Anders Fuglsang
Journal: Protein Expr Purif Date: 2003-10 Impact factor: 1.650

Review 2. Codon bias and heterologous protein expression.

Authors: Claes Gustafsson; Sridhar Govindarajan; Jeremy Minshull
Journal: Trends Biotechnol Date: 2004-07 Impact factor: 19.536

3. A simple, rapid, high-fidelity and cost-effective PCR-based two-step DNA synthesis method for long gene sequences.

Authors: Ai-Sheng Xiong; Quan-Hong Yao; Ri-He Peng; Xian Li; Hui-Qin Fan; Zong-Ming Cheng; Yi Li
Journal: Nucleic Acids Res Date: 2004-07-07 Impact factor: 16.971

4. Two-step total gene synthesis method.

Authors: Lei Young; Qihan Dong
Journal: Nucleic Acids Res Date: 2004-04-15 Impact factor: 16.971

5. The pace and proliferation of biological technologies.

Authors: Robert Carlson
Journal: Biosecur Bioterror Date: 2003

6. Generating a synthetic genome by whole genome assembly: phiX174 bacteriophage from synthetic oligonucleotides.

Authors: Hamilton O Smith; Clyde A Hutchison; Cynthia Pfannkoch; J Craig Venter
Journal: Proc Natl Acad Sci U S A Date: 2003-12-02 Impact factor: 11.205

7. Codon optimization reveals critical factors for high level expression of two rare codon genes in Escherichia coli: RNA stability and secondary structure but not tRNA abundance.

Authors: Xiaoqiu Wu; Hans Jörnvall; Kurt D Berndt; Udo Oppermann
Journal: Biochem Biophys Res Commun Date: 2004-01-02 Impact factor: 3.575

8. Microfluidic PicoArray synthesis of oligodeoxynucleotides and simultaneous assembling of multiple DNA sequences.

Authors: Xiaochuan Zhou; Shiying Cai; Ailing Hong; Qimin You; Peilin Yu; Nijing Sheng; Onnop Srivannavit; Seema Muranjan; Jean Marie Rouillard; Yongmei Xia; Xiaolin Zhang; Qin Xiang; Renuka Ganesh; Qi Zhu; Anna Matejko; Erdogan Gulari; Xiaolian Gao
Journal: Nucleic Acids Res Date: 2004-10-11 Impact factor: 16.971

9. UpGene: Application of a web-based DNA codon optimization algorithm.

Authors: Wentao Gao; Alexis Rzewski; Huijie Sun; Paul D Robbins; Andrea Gambotto
Journal: Biotechnol Prog Date: 2004 Mar-Apr

10. OPTIMIZER: a web server for optimizing the codon usage of DNA sequences.

Authors: Pere Puigbò; Eduard Guzmán; Antoni Romeu; Santiago Garcia-Vallvé
Journal: Nucleic Acids Res Date: 2007-04-16 Impact factor: 16.971

12 in total

1. Multifactorial determinants of protein expression in prokaryotic open reading frames.

Authors: Malin Allert; J Colin Cox; Homme W Hellinga
Journal: J Mol Biol Date: 2010-08-18 Impact factor: 5.469

Review 2. You're one in a googol: optimizing genes for protein expression.

Authors: Mark Welch; Alan Villalobos; Claes Gustafsson; Jeremy Minshull
Journal: J R Soc Interface Date: 2009-03-11 Impact factor: 4.118

Review 3. Manipulating the genetic code for membrane protein production: what have we learnt so far?

Authors: Morten H H Nørholm; Sara Light; Minttu T I Virkki; Arne Elofsson; Gunnar von Heijne; Daniel O Daley
Journal: Biochim Biophys Acta Date: 2011-08-22

Review 4. Plasmid DNA vaccine vector design: impact on efficacy, safety and upstream production.

Authors: James A Williams; Aaron E Carnes; Clague P Hodgson
Journal: Biotechnol Adv Date: 2009-02-20 Impact factor: 14.227

Review 5. Strategies for protein synthetic biology.

Authors: Raik Grünberg; Luis Serrano
Journal: Nucleic Acids Res Date: 2010-04-12 Impact factor: 16.971

6. Gene optimization mechanisms: a multi-gene study reveals a high success rate of full-length human proteins expressed in Escherichia coli.

Authors: Barbara Maertens; Anne Spriestersbach; Uritza von Groll; Udo Roth; Jan Kubicek; Michael Gerrits; Marcus Graf; Michael Liss; Daniela Daubert; Ralf Wagner; Frank Schäfer
Journal: Protein Sci Date: 2010-07 Impact factor: 6.725

7. Genome-scale analysis of translation elongation with a ribosome flow model.

Authors: Shlomi Reuveni; Isaac Meilijson; Martin Kupiec; Eytan Ruppin; Tamir Tuller
Journal: PLoS Comput Biol Date: 2011-09-01 Impact factor: 4.475

8. Experimental analysis of gene assembly with TopDown one-step real-time gene synthesis.

Authors: Hongye Ye; Mo Chao Huang; Mo-Huang Li; Jackie Y Ying
Journal: Nucleic Acids Res Date: 2009-03-05 Impact factor: 16.971

9. Design parameters to control synthetic gene expression in Escherichia coli.

Authors: Mark Welch; Sridhar Govindarajan; Jon E Ness; Alan Villalobos; Austin Gurney; Jeremy Minshull; Claes Gustafsson
Journal: PLoS One Date: 2009-09-14 Impact factor: 3.240

10. Combined protein construct and synthetic gene engineering for heterologous protein expression and crystallization using Gene Composer.

Authors: Amy Raymond; Scott Lovell; Don Lorimer; John Walchli; Mark Mixon; Ellen Wallace; Kaitlin Thompkins; Kimberly Archer; Alex Burgin; Lance Stewart
Journal: BMC Biotechnol Date: 2009-04-21 Impact factor: 2.563