Literature DB >> 19395596

ProteinCCD: enabling the design of protein truncation constructs for expression and crystallization experiments.

Wijnand T M Mooij¹, Eirini Mitsiki, Anastassis Perrakis.

Abstract

ProteinCCD (CCD for Crystallographic Construct Design) aims to facilitate a common practice in structural biology, namely the design of several truncation constructs of the protein under investigation, based on experimental data or on sequence analysis tools. ProteinCCD functions as a meta-server, available online at http://xtal.nki.nl/ccd, that collects information from prediction servers concerning secondary structure, disorder, coiled coils, transmembrane segments, domains and domain linkers. It then displays a condensed view of all results against the protein sequence. The user can study the output and choose interactively possible starts and ends for suitable protein constructs. Since the required input to ProteinCCD is the DNA and not the protein sequence, once the starts and ends of constructs are chosen, the software can automatically design the oligonucleotides needed for PCR amplification of all constructs. ProteinCCD outputs a comprehensive view of all constructs and all oligos needed for bookkeeping or for direct copy-paste ordering of the designed oligonucleotides.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Oligonucleotides
Proteins

Year: 2009 PMID： 19395596 PMCID： PMC2703965 DOI： 10.1093/nar/gkp256

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The production of soluble proteins in amounts suitable for structural studies has been a common bottleneck in structural biology and structural genomics alike (1,2). For X-ray crystallographic studies an additional goal is to obtain not only soluble protein, but also a specific construct with a high propensity to crystallize. Similarly, for NMR studies the soluble protein domain does not only need to be relatively small, but also highly soluble in relatively high concentration to deliver clear spectra. The advent of cloning techniques that are high-throughput, inexpensive and compatible with robotic implementations (3,4) allows parallel construction of tens of expression constructs for each protein under study; a standard practice in many labs. Expression constructs can be designed based on experimental information, typically limited proteolysis experiments followed by mass spectrometry based identification of the proteolytic fragments (4). Computational design based on sequence analysis is another method of choice. The use of multiple sequence alignments is wide spread and there are a variety of specific tools, e.g. T-Coffee (5) or MUSCLE (6). Based on similarities and differences among family members, the researcher decides what are the likely domain boundaries that will yield soluble, well-behaved proteins. Comparative modeling (7) can also be used if a structure of a homologous protein is known. Finally, a variety of sequence analysis methods aim to deliver structural information from the sequence alone. Although significant progress has been made in sequence analysis, there is no definitive method of choice. Typically, most researchers use a ‘personal’ collection of web-based tools for sequence analysis, based on prior experience. A clear bottleneck arises, as it is cumbersome to display the results of different tools in a condensed form. At present, this requires submitting many queries to different servers, and subsequent copying-pasting to compare the results of different methods. After a concise and condensed representation of all analysis results is obtained, the researcher typically decides what are promising domain boundaries for the protein in hand. The next step is to design oligonucleotides to be used for PCR-based amplification of all these fragments. At this stage a trivial but time consuming additional bottleneck is encountered: the protein-based analysis has to be transformed back to the DNA sequence. Although the task is by all means trivial, it is time consuming and error prone, since the direct mapping between protein and DNA sequence is lost in the analysis step.

METHODS

We have developed a web-based tool to address these bottlenecks, which provides a simple and practical solution for daily use in the research laboratory. The tool we developed, ProteinCCD, addresses these bottlenecks by: We chose methods from four groups of sequence analysis tools. Acting as a meta-server that combines several sequence analysis methods that are available as web services and commonly used for expression construct design, and displaying all results in a condensed and concise manner. Requiring as user input the DNA rather than the protein sequence and thus enabling the 'single-click' design of all oligonucleotides needed for PCR amplification of the user-designed constructs. The first group concerns secondary structure prediction servers, aiming to predict stretches of sequences that are likely to be either helices or strands in the three dimensional structure (8) and should not be disrupted. Different algorithms give slightly different results especially at the secondary structure element boundaries. Therefore we have included a few methods from this group, and use the collection available at the Network Protein Sequence @nalysis (9) (NPS@). The second group of sequence analysis methods concerns algorithms that aim to predict disordered regions in protein sequences. We chose four methods: IUPred (10) which uses the estimated pairwise energy content; RONN (11) which is based on a Bio-Basis Function Neural Network (BBFNN) to predict intrinsically disordered regions in proteins; DisEMBL (12) which uses empirical definitions for disorder predictions; and GlobPlot (13) which uses predictions for globularity against disorder to better identify disordered regions in proteins. The third group of methods involves specialized servers for specific features of protein sequence. Currently we check for coiled-coil regions (14), and trans-membrane topology prediction combined with signal peptide prediction, as available from the Phobius webserver (15). The fourth and final group of methods includes two of the many new algorithms that look for domains. The Simple Modular Architecture Research Tool (16) (SMART) aims to identify and annotate genetically mobile domains and to analyze domain architecture. The Domain Linker Predictor (17) in contrast attempts to flag the regions between likely domains. A user needs to submit a cDNA sequence to ProteinCCD. This entry is first checked for validity, translated to amino-acid sequence, displayed in the output window and sequentially submitted to the selected servers. As results are returned in real time, they are displayed below the query sequence in a multiple-alignment manner allowing the direct comparison of results between different methods. The user can utilize the condensed comparative output, to scroll along the sequence and choose domain boundaries, with simple mouse-clicks for choosing N- and C-terminal boundaries. This step is entirely up to the user and no automated method is provided. As soon as the user has selected domain boundaries, ProteinCCD outputs a list of the resulting protein sequences and all the oligonucleotides needed for the PCR amplification of these sequences. Oligonucleotides are chosen based either on the simple rule that N nucleotides are needed for annealing (default N = 20) or by choosing enough nucleotides to reach a user defined annealing temperature Tm (default Tm= 65) based on the formula: where GCnr is the number of guanine and cytosine nucleotides in the oligonucleotide sequence. User-defined overhangs are appended both on the 5′ and the 3′ oligonucleotide, to facilitate cloning and quick cut-and-paste ordering of the final oligonucleotides. ProteinCCD does not check oligonucleotides for secondary structure or for false-annealing sites.

EXAMPLE

We will now consider a very simple example describing the steps to analyze a new sequence and design five simple truncation constructs. For the example we will use the protein Dug2 from yeast (18). In Figure 1 there is an overview of the ProteinCCD web server after the submitted job was finished. The only input was the DNA sequence of Dug2. After pasting the sequence to the top field and pressing the ‘Submit’ button the servers selected by default returned the analysis results for the protein sequence. The translated protein sequence and the ‘aligned’ predictions appear in the panel below the input. Subsequently, we selected with the mouse three N-termini: The N-terminal residues 1, 214 and 510; and three C-termini: residues 106, 458 and the terminal residue 878. In Figure 2A, we display the predictions in the area of all the six termini. In brief, SMART (16) shows that Dug2 contains four WD40 repeats, spanning the N-terminal half and a peptidase domain spanning the C-terminal half. Here we discuss how we made the exact choices for the N- and C-termini of all truncation constructs.

Figure 1.

An overview of the ProteinCCD server after all predictions have been collected, user choices have been made, and the oligonucleotides have been suggested, for the example discussed in this paper.

Figure 2.

(A) Regions of interest from the ProteinCCD output where starts (green letters) and stops (red letters) are marked by the user. Discontinuities in the sequence have been marked with (...) (B) The oligonucleotide sequences for PCR amplifications of the regions of interest as suggested by ProteinCCD. (C) An ethidium bromide stained agarose gel showing the PCR products for the five selected truncation constructs of interest obtained with the designed oligonucleotides under standard experimental conditions (D) A polyacrylamide SDS gel stained with Coomassie blue showing the expressed and IMAC purified truncation constructs designed and cloned above. The (*) denotes the protein of interest in each lane.

Nterm -1: The first WD40 repeat starts essentially at the very start of the protein, thus the natural N-terminus was considered as a good selection for an expression construct. Cterm-106: The first pair of WD40 repeats ends at residue 98. Since a consensus strand prediction runs to residue 99, we chose to include a few more additional residues to be sure that the C-terminal predicted strand interactions are maintained. Nterm-214: This residue is right at the start of the helix of the third predicted WD40 domain. Cterm-458: The fourth predicted WD40 domain predicted by SMART ends at residue 396. However, the three secondary structure predictions contradict each other: they indicate a short strand followed by helix, long but weak strand prediction, short but strong strand prediction. On top, this residue is within a region predicted by RONN (11) to be disordered and—contradicting that prediction—just before a region predicted to be a domain linker (17). Finally, we noticed that two of the secondary structure prediction programs predict two helices for the next ∼50 residues. Thus, we decided in this case to use as a C-terminus residue 458 and include these two secondary structure elements. Nterm-510: The peptidase SMART domain is predicted to start at residue 516. However, a consensus strand is predicted to start residue 513. To be sure we don't interrupt the secondary structure elements, which might be important for domain folding, we chose to start that construct at residue 510. Cterm-878: The peptidase SMART domain ends up right at the natural C-terminus and choosing exactly that was an easy choice. An overview of the ProteinCCD server after all predictions have been collected, user choices have been made, and the oligonucleotides have been suggested, for the example discussed in this paper. (A) Regions of interest from the ProteinCCD output where starts (green letters) and stops (red letters) are marked by the user. Discontinuities in the sequence have been marked with (...) (B) The oligonucleotide sequences for PCR amplifications of the regions of interest as suggested by ProteinCCD. (C) An ethidium bromide stained agarose gel showing the PCR products for the five selected truncation constructs of interest obtained with the designed oligonucleotides under standard experimental conditions (D) A polyacrylamide SDS gel stained with Coomassie blue showing the expressed and IMAC purified truncation constructs designed and cloned above. The (*) denotes the protein of interest in each lane. After these ‘starts’ and ‘stops’ are selected by the corresponding buttons in the web server application, the ‘Submit’ button is pressed. This automatically calculates the needed oligonucleotides to design all possible constructs between these termini. In this particular case we only wanted constructs Dug21–878, Dug21–458, Dug2214–878, Dug2510–878, Dug21–106. The oligonucleotides that were designed to amplify these constructs and clone them to the NKI-LIC-His-3C vector, are presented in Figure 2B. In Figure 2C we show the amplified DNA under standard PCR conditions, and in Figure 2D the expression experiments, that resulted in all constructs being soluble. In conclusion, in this case study, educated ‘guesses’ for construct design were made very easy by the comprehensive view offered by ProteinCCD, and the experiments designed with the aid of the web server resulted in soluble proteins suitable for structural and functional studies.

CONCLUSION AND PERSPECTIVE

ProteinCCD provides a user-friendly tool to facilitate protein construct design and oligonucleotide ordering. By consolidating tools that are common in structural biology in a single platform ProteinCCD enables comparative analysis of the sequence, and by keeping track of both the protein and DNA sequence it allows the straightforward design of oligonucleotides for PCR amplification of the protein constructs. The choice of protein constructs for structural studies is not a straightforward task. Although some notable automation attempts exist, there is no widely accepted method. Thus, at this stage we chose to only implement a tool that collects the results of a variety of popular web servers. This enables informed decisions by the user, but at present not further automation is provided, and we are unable to quantify or benchmark the server, since it is clearly dependent on user choices. It provides an enabling technology and not an automated tool. Among the plans for future work however, is to allow users to submit their constructs choices to a connected database. This could allow for the future development of supervised learning methods to imitate user choices. This would still not allow the development of algorithms that actually predict which constructs end up being soluble (or even crystallizable), since that information is not readily available and is very difficult to collect outside the context of well-managed high-throughput projects. It can be argued however, that a learning system that mimics common user choices for a variety of targets could indeed be a useful direction for future research.

FUNDING

ProteinCCD has been both inspired and tested as part of our participation in the EU FP6 programs 3D-Repertoire (LSHG-CT-2005-512028) and SPINE2-Complexes (LSH-2004-1.1.2-1). Funding for open access charge: Netherlands Cancer Institute. Conflict of interest statement. None declared.

18 in total

Review 1. NPS@: network protein sequence analysis.

Authors: C Combet; C Blanchet; C Geourjon; G Deléage
Journal: Trends Biochem Sci Date: 2000-03 Impact factor: 13.807

Review 2. High-throughput proteomics: protein expression and purification in the postgenomic world.

Authors: S A Lesley
Journal: Protein Expr Purif Date: 2001-07 Impact factor: 1.650

3. GlobPlot: Exploring protein sequences for globularity and disorder.

Authors: Rune Linding; Robert B Russell; Victor Neduva; Toby J Gibson
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

4. Characterization and prediction of linker sequences of multi-domain proteins by a neural network.

Authors: Satoshi Miyazaki; Yutaka Kuroda; Shigeyuki Yokoyama
Journal: J Struct Funct Genomics Date: 2002

5. MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors: Robert C Edgar
Journal: Nucleic Acids Res Date: 2004-03-19 Impact factor: 16.971

6. RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins.

Authors: Zheng Rong Yang; Rebecca Thomson; Philip McNeil; Robert M Esnouf
Journal: Bioinformatics Date: 2005-06-09 Impact factor: 6.937

Review 7. Protein production and purification.

Authors: Susanne Gräslund; Pär Nordlund; Johan Weigelt; B Martin Hallberg; James Bray; Opher Gileadi; Stefan Knapp; Udo Oppermann; Cheryl Arrowsmith; Raymond Hui; Jinrong Ming; Sirano dhe-Paganon; Hee-won Park; Alexei Savchenko; Adelinda Yee; Aled Edwards; Renaud Vincentelli; Christian Cambillau; Rosalind Kim; Sung-Hou Kim; Zihe Rao; Yunyu Shi; Thomas C Terwilliger; Chang-Yub Kim; Li-Wei Hung; Geoffrey S Waldo; Yoav Peleg; Shira Albeck; Tamar Unger; Orly Dym; Jaime Prilusky; Joel L Sussman; Ray C Stevens; Scott A Lesley; Ian A Wilson; Andrzej Joachimiak; Frank Collart; Irina Dementieva; Mark I Donnelly; William H Eschenfeldt; Youngchang Kim; Lucy Stols; Ruying Wu; Min Zhou; Stephen K Burley; J Spencer Emtage; J Michael Sauder; Devon Thompson; Kevin Bain; John Luz; Tarun Gheyi; Fred Zhang; Shane Atwell; Steven C Almo; Jeffrey B Bonanno; Andras Fiser; Sivasubramanian Swaminathan; F William Studier; Mark R Chance; Andrej Sali; Thomas B Acton; Rong Xiao; Li Zhao; Li Chung Ma; John F Hunt; Liang Tong; Kellie Cunningham; Masayori Inouye; Stephen Anderson; Heleema Janjua; Ritu Shastry; Chi Kent Ho; Dongyan Wang; Huang Wang; Mei Jiang; Gaetano T Montelione; David I Stuart; Raymond J Owens; Susan Daenke; Anja Schütz; Udo Heinemann; Shigeyuki Yokoyama; Konrad Büssow; Kristin C Gunsalus
Journal: Nat Methods Date: 2008-02 Impact factor: 28.547

Review 8. Predicting coiled-coil regions in proteins.

Authors: A Lupas
Journal: Curr Opin Struct Biol Date: 1997-06 Impact factor: 6.809

9. The sequence of a 32,420 bp segment located on the right arm of chromosome II from Saccharomyces cerevisiae.

Authors: K Holmstrøm; T Brandt; T Kallesøe
Journal: Yeast Date: 1994-04 Impact factor: 3.239

10. Protein disorder prediction: implications for structural proteomics.

Authors: Rune Linding; Lars Juhl Jensen; Francesca Diella; Peer Bork; Toby J Gibson; Robert B Russell
Journal: Structure Date: 2003-11 Impact factor: 5.006

15 in total

1. Target selection for structural genomics based on combining fold recognition and crystallisation prediction methods: application to the human proteome.

Authors: James E Bray
Journal: J Struct Funct Genomics Date: 2012-02-22

2. Application of protein engineering to enhance crystallizability and improve crystal properties.

Authors: Zygmunt S Derewenda
Journal: Acta Crystallogr D Biol Crystallogr Date: 2010-04-21

3. Lessons from high-throughput protein crystallization screening: 10 years of practical experience.

Authors: Joseph R Luft; Edward H Snell; George T Detitta
Journal: Expert Opin Drug Discov Date: 2011-03-22 Impact factor: 6.098

4. Structures of Xenopus Embryonic Epidermal Lectin Reveal a Conserved Mechanism of Microbial Glycan Recognition.

Authors: Kittikhun Wangkanont; Darryl A Wesener; Jack A Vidani; Laura L Kiessling; Katrina T Forest
Journal: J Biol Chem Date: 2016-01-11 Impact factor: 5.157

5. The domain architecture of the protozoan protein J-DNA-binding protein 1 suggests synergy between base J DNA binding and thymidine hydroxylase activity.

Authors: Athanassios Adamopoulos; Tatjana Heidebrecht; Jeroen Roosendaal; Wouter G Touw; Isabelle Q Phan; Jos Beijnen; Anastassis Perrakis
Journal: J Biol Chem Date: 2019-07-10 Impact factor: 5.157

6. A secretory system for bacterial production of high-profile protein targets.

Authors: Alexander Kotzsch; Erik Vernet; Martin Hammarström; Jens Berthelsen; Johan Weigelt; Susanne Gräslund; Michael Sundström
Journal: Protein Sci Date: 2011-03 Impact factor: 6.725

7. Triggered Mycobacterium tuberculosis heparin-binding hemagglutinin adhesin folding and dimerization.

Authors: Joseph V Lomino; Ashutosh Tripathy; Matthew R Redinbo
Journal: J Bacteriol Date: 2011-03-11 Impact factor: 3.490

8. Idas, a novel phylogenetically conserved geminin-related protein, binds to geminin and is required for cell cycle progression.

Authors: Dafni-Eleutheria Pefani; Maria Dimaki; Magda Spella; Nickolas Karantzelis; Eirini Mitsiki; Christina Kyrousi; Ioanna-Eleni Symeonidou; Anastassis Perrakis; Stavros Taraviras; Zoi Lygerou
Journal: J Biol Chem Date: 2011-05-04 Impact factor: 5.157

9. Characterization of the N-terminal domain of BteA: a Bordetella type III secreted cytotoxic effector.

Authors: Chen Guttman; Geula Davidov; Hadassa Shaked; Sofiya Kolusheva; Ronit Bitton; Atish Ganguly; Jeff F Miller; Jordan H Chill; Raz Zarivach
Journal: PLoS One Date: 2013-01-30 Impact factor: 3.240

10. Crystal structure of KLHL3 in complex with Cullin3.

Authors: Alan X Ji; Gilbert G Privé
Journal: PLoS One Date: 2013-04-03 Impact factor: 3.240