Literature DB >> 17663785

Domain selection combined with improved cloning strategy for high throughput expression of higher eukaryotic proteins.

Yunjia Chen1, Shihong Qiu, Chi-Hao Luan, Ming Luo.   

Abstract

BACKGROUND: Expression of higher eukaryotic genes as soluble, stable recombinant proteins is still a bottleneck step in biochemical and structural studies of novel proteins today. Correct identification of stable domains/fragments within the open reading frame (ORF), combined with proper cloning strategies, can greatly enhance the success rate when higher eukaryotic proteins are expressed as these domains/fragments. Furthermore, a HTP cloning pipeline incorporated with bioinformatics domain/fragment selection methods will be beneficial to studies of structure and function genomics/proteomics.
RESULTS: With bioinformatics tools, we developed a domain/domain boundary prediction (DDBP) method, which was trained by available experimental data. Combined with an improved cloning strategy, DDBP had been applied to 57 proteins from C. elegans. Expression and purification results showed there was a 10-fold increase in terms of obtaining purified proteins. Based on the DDBP method, the improved GATEWAY cloning strategy and a robotic platform, we constructed a high throughput (HTP) cloning pipeline, including PCR primer design, PCR, BP reaction, transformation, plating, colony picking and entry clones extraction, which have been successfully applied to 90 C. elegans genes, 88 Brucella genes, and 188 human genes. More than 97% of the targeted genes were obtained as entry clones. This pipeline has a modular design and can adopt different operations for a variety of cloning/expression strategies.
CONCLUSION: The DDBP method and improved cloning strategy were satisfactory. The cloning pipeline, combined with our recombinant protein HTP expression pipeline and the crystal screening robots, constitutes a complete platform for structure genomics/proteomics. This platform will increase the success rate of purification and crystallization dramatically and promote the further advancement of structure genomics/proteomics.

Entities:  

Mesh:

Substances:

Year:  2007        PMID: 17663785      PMCID: PMC1950093          DOI: 10.1186/1472-6750-7-45

Source DB:  PubMed          Journal:  BMC Biotechnol        ISSN: 1472-6750            Impact factor:   2.563


Background

One of the results from genome sequencing projects, such as the human genome project, is to promote the development of structural genomics/proteomics endeavors which focus on the large-scale determination of protein structures and functions. The traditional cloning and expression approach is inadequate for such a daunting task, and high throughput (HTP) methods are clearly necessary [1,2]. An integrated robotic pipeline can streamline the complex experimental procedures and makes it possible to carry out gene cloning and protein expression for a large amount of targets in a timely and reproducible manner. Some groups have developed the HTP cloning method including the design of nested primers for PCR cloning [3], while we have also developed an automated pipeline for recombinant protein expression, applying the GATEWAY cloning/expression technology and a stepwise automation strategy on an integrated robotic platform [4]. The robotic pipeline is fully operational and has produced a large number of soluble recombinant proteins in E. coli using the open reading frame cDNA library (ORFeome) for C. elegans and human genomes [5,6]. However, the success rate of expressing soluble proteins is limited when the full length ORF was used to express the target protein. In a number of cases, including our own results, soluble proteins could be expressed in E. coli when a smaller fragment derived from the ORF was used for expression [7-10]. We have identified smaller protein fragments from spontaneous degradation and limited proteolysis, and recloned them for expression [7,8]. Compared to expressing soluble proteins carrying GATEWAY tags due to cloning artifacts, the soluble expression rate was increased from 1.3% to 27.6% when the GATEWAY tags were not included, and a 41.7% rate of soluble expression was achieved when the identified fragment without both GATEWAY tag encoded sequences was recloned (data not shown). The GATEWAY tags named here refer to the amino acid sequences TSLYKKAGX and TQLSCTKW, resulted from the recombination site attB1 or attB2, respectively, generated by the GAETWAY LR reaction [11]. X refers to the amino acid that depends on the coding sequence. With pET15g as the expression vector, which was engineered using pET15b (Novagen) to be compatible with GATEWAY cloning [4], the final N-terminal tag sequences in the originally and newly cloned genes are MGSSHHHHHHSSGLVPRGSQSTSLYKKAGX and MGSSHHHHHHSSGLVPRGSQSTSLYKKAGLVPRGS respectively, in which HHHHHH is the his-tag followed by a thrombin cleavage site (LVPR|GS, named thrombin site I, the cutting site is between R and G) deprived from pET15b vector, TSLYKKAG is the N-terminal GATEWAY tag generated by GATEWAY LR reaction, and the last LVPRGS is the newly introduced thrombin site (named thrombin site II) that is used to eliminate the N-terminal GATEWAY tag. No C-terminal GATEWAY tag was present in the newly cloned genes by the introduction of a stop codon after the coding sequence. Thus the clones in which GATEWAY tags were included expressed a recombinant protein that had Sequence I, i.e. GSQSTSLYKKAGX at the N-terminus and Sequence II, i.e. TQLSCTKW at the C-terminus in addition to the coding sequence after the his-tag was removed by protease digestion through the thrombin site I. In the clones without the GATEWAY tags, the recombinant protein contained only GS at the N-terminus in addition to the coding sequence. More recently, 23 fragments were recloned and 6 of them have resulted in diffracting quality crystals, which led to 3 structures [7,8]. These findings suggested that the sequences derived from GATEWAY tags affect the soluble expression and a well folded fragment/domain of the target protein is best suited for expression of a soluble recombinant protein in E. coli. In fact, 90% of the structures of human proteins deposited in the Protein Data Bank (PDB) [12] comprise a fragment of the gene. We therefore modified our robotic pipeline to incorporate an automatic operation that can select a proper domain/fragment from the ORF for recombinant protein expression and used new cloning strategy described above. New bioinformatics tools and cloning methods were developed and adopted to the previously established robotic pipeline, as discussed in this report. The major modifications included the automatic design of PCR primers, and improved multi-step laddered PCR, followed by previously established micro BP reaction of GATEWAY cloning, transformation, plating of transformed E. coli cells (DH5α), colony picking and entry clone plasmid DNA extraction. The automated cloning module is combined with our automated protein expression module that consists of construction of expression clones in 96-well plates, protein solubility profiling by dynamic ELISA, as a protein expression platform for structural genomics/proteomics. The cloning module is flexible and efficient to carry out different cloning strategies as shown here. A number of algorithms for predicting domain boundaries have been developed previously [13-18]. Most of them, however, are not publicly available or cannot be adapted to our HTP pipeline. We report here a new composite scheme to locate domains with relatively accurate boundaries. Programs included in the scheme are InterPro/InterProScan [19,20] and Domain Linker Finder [16], BLAST [21], SignalP [22,23] and TMHMM [24]. The BLAST alignment and signal peptide, transmembrane (TM) region prediction were combined with the results of InterPro/InterProScan and Domain Linker Finder to define the fragment for cloning. This composite method has been validated with experimental results.

Results and discussion

HTP cloning of 366 ORFs

The GATEWAY system is a suitable method for HTP cloning in 96-well plates. However, when entry clones (generated with pDONR201) and the expression vector pET15g are combined by the LR reaction, the recombination sequence attB1 may add additional unwanted 9 amino acids (TSLYKKAGX) at the N-terminus if the insert is downstream from a fusion peptide, and the attB2 site may add TQLSCTKW at the C-terminus if no stop codon follows the coding sequence. We named sequences from attB1 and attB2 as the GATEWAY tags. The additional amino acids derived from GATEWAY tags may interfere with subsequent experiments, such as soluble expression of the recombinant protein, purification problems due to aggregation of the protein, and crystallization of the protein (see descriptions in Background). It is therefore desirable to engineer a protease (thrombin here) cleavage site (PCS) after attB1 (Figure 1). A stop codon was also added right after the coding sequence in primer design to eliminate the extra amino acids at the C-terminus due to GATEWAY cloning. After the protein is purified, all amino acids prior to PCS, i.e. MGSSHHHHHHSSGLVPRGSQSTSLYKKAGLVPR, can be removed by the protease cleavage. Compared with the clones in which GATEWAY tags were included, newly cloned and expressed recombinant proteins contained only GS at the N-terminus in addition to the coding sequence. And if no new PCS was introduced, expressed proteins would have Sequence I, i.e. GSQSTSLYKKAGX at the N-terminus and Sequence II, i.e. TQLSCTKW at the C-terminus in addition to the coding sequence after the his-tag was removed by protease digestion through the thrombin site I (For details, see Background). Since the PCS was included in the primer synthesis in our strategy and the long forward primer would be costly and could increase the chance of errors, we designed a PCR strategy using two forward primers and two reverse primers (see Methods: Primer design and the PCR protocol for HTP cloning). This strategy has two advantages: only short primers are required, and primer F2, R2 could be synthesized in bulk. Such measures significantly reduce the cost and the error rate in 96-well operations.
Figure 1

The primer design strategy using two pairs of primers. Primer F2 and R2 contained attB sites and no gene specific region, which could be synthesized in bulk; Primer F1 and R1 contained gene specific sequences and an overlap region with Primer F2 and R2. CDS stands for coding sequence and a protease cleavage site was engineered after attB1 site.

A comprehensive computer program has been developed to carry out primer designs for selected genes. Usually the length of the gene-specific nucleotides in the entire primer should be maintained between 20 to 30 bases according to the manufacturer's manual [25] and our previous experience. The length of gene-specific oligos in this program is therefore set in this range. Since PCR clones are to be carried out in 96-well plates, conditions for all wells, such as denaturation time, cycle number, are the same even though each well represents a different gene. Therefore in addition to grouping coding regions with a similar length in one plate, we also chose to design primers that would result in a similar melting temperature (Tm). The best value for Tm was about 60°C for our experiments, so we tried to make the Tm of all oligos as close to 60°C as possible by adding or subtracting one base at a time. Besides the length of oligos, the salt concentration can also affect the Tm. In our program, the salt concentration was set at 10 mM. After the gene-specific oligo was designed with optimal Tm, sequences corresponding to attB1 or attB2, PCS and a stop codon were added. The primer design program was written in PERL, which could be easily modified to accommodate changes in primer sequences. The primer design strategy using two pairs of primers. Primer F2 and R2 contained attB sites and no gene specific region, which could be synthesized in bulk; Primer F1 and R1 contained gene specific sequences and an overlap region with Primer F2 and R2. CDS stands for coding sequence and a protease cleavage site was engineered after attB1 site. After receiving primers for 90 C. elegans, 88 Brucella, and 188 human ORFs in 96-well plate, HTP cloning (Figure 3), including PCR, E-Gel check, BP reaction, transformation, colony picking, cell culture and mini-prep, was performed on our integrated robotic platform. From 366 attempted amplifications, 337 PCR products could be detected by E-Gel (Figure 4). Interestingly, 20 vectors, out of 29, whose PCR products could not be detected by E-Gel could still be transformed and obtained as entry clones successfully. This phenomenon has also been observed by other research groups [26]. Including clones that were derived from PCR products not detectable by E-Gel, but transformed successfully, our PCR protocol showed a success rate of 97.5%. Our follow-up results suggested that PCR determines the final success rate of the whole HTP cloning process (Table 1), whereas other steps, such as BP reaction, transformation, have negligible effects on the final outcome. Finally 96.7% ORFs were obtained as entry clones, which were verified by PCR/E-Gel check.
Figure 3

A schematic representation of HTP cloning and expression pipeline with the aid of bioinformatics tools. In above HTP cloning pipeline, some steps, which were marked with star, were not performed on BiomekFX robot. ExtractCDS and BatchPrimer were two PERL programs used for extraction of the DNA coding sequence from a full-length sequence (ORF) and design of gene specific primers.

Figure 4

An E-Gel test result for entry clones of the second plate of 94 human genes. 2% E-Gel® 96 Agarose with E-Gel® Low Range Quantitative DNA Ladder were used in the test.

Table 1

Statistic of PCR and entry clone success rates of HTP cloning

allPCR (success rate)entry clone (success rate)
C. elegans9085 (94.4%)83 (92.2%)
Human188184 (97.9%)183 (97.3%)
Brucella8888 (100%)88 (100%)
Total366357 (97.5%)354 (96.7%)
A schematic representation of HTP cloning and expression pipeline with the aid of bioinformatics tools. In above HTP cloning pipeline, some steps, which were marked with star, were not performed on BiomekFX robot. ExtractCDS and BatchPrimer were two PERL programs used for extraction of the DNA coding sequence from a full-length sequence (ORF) and design of gene specific primers. An E-Gel test result for entry clones of the second plate of 94 human genes. 2% E-Gel® 96 Agarose with E-Gel® Low Range Quantitative DNA Ladder were used in the test. Statistic of PCR and entry clone success rates of HTP cloning

Validation of domain identification

Proteins are usually composed of multiple domains connected by linkers. Removal of flexible tails or separation of fragments would yield more compact and stable protein fragments that are more suitable for expression of a soluble recombinant protein and subsequent studies including crystallization, as demonstrated by data presented below. We aimed at developing an integrated strategy, named DDBP (domain/domain boundary prediction), to predict domain boundaries and stable fragments within the full length protein coded by the ORF. In this strategy, InterPro/InterProScan, PDB homology alignment, and Domain Linker Finder were the core methods used for domain prediction. In addition, signal peptide prediction by SignalP and TM regions prediction by TMHMM provided supplementary information for more accurate prediction. InterPro is an integrated database that consists of most of the essential databases for domain and function site available today, such as PFAM [27], ProDom [28], SMART [29], PRINTS [30], PROSITE [31], TIGRFAM [32], SUPERFAMILY [33], etc. InterProScan, which is used together with InterPro database, is a tool that combines different protein signature recognition methods into one resource. Since InterPro contains many different domain and function site databases, conflicted results often appear when different databases were used. Moreover, InterPro/InterProScan analysis could only predict the core region of a domain, but not the domain boundaries. To improve the prediction accuracy, Domain Linker Finder (DLF), which applies the neural network method to distinguish domain linker sequences from non-linker sequences, was used to confirm domain prediction results obtained by InterPro/InterProScan, and to define more accurately the domain boundaries. As the first step of DDBP, prediction of the signal peptide and the TM region for each ORF was carried out by SignalP and TMHMM, respectively. The identified signal peptide was eliminated as an unstable region, and TM regions would be treated as domain linkers that were later integrated into the results from DLF. The second step is to perform BLAST analysis against the PDB database to find potential domain relevant information. Finally InterPro/InterProScan and DLF programs were executed. When results of InterPro/InterProScan and DLF were available, further analyses were performed: (1) if results of InterPro/InterProScan can be confirmed by PDB alignment results, manually integrate them and decide common domain boundary positions. For example, protein 3-H6, i.e. NP_508026 (Figure 5A), which comprises 431 amino acids, has no signal peptide and TM regions according to the prediction of SignalP and TMHMM. The result of InterPro/InterProScan showed this protein contains three possible domains/fragments: Domain1 (24–118), Domain2 (141–234) and Domain3 (254–370). While PDB alignment results showed: (a) the region 4–131 of 3-H6 is homologous to the region 20–147 of a 149-Amino-acid protein (PDB ID: 1ROU, containing 1 domain) with 60% identity; (b) the region 4–244 of 3-H6 is similar to the region 41–280 of a 280-amino-acid protein (PDB ID: 1Q1C, containing 2 domains) with 48% identity; (c) the region 7–408 of 3-H6 is similar to the region 24–422 of a 457-amino-acid protein (PDB ID: 1KTO/A, containing 3 domains) with 40% identity; (d) the region 128–428 of 3-H6 is similar to the region 22–330 of a 336-amino-acid protein (PDB ID: 1P5Q/A, containing 2 domains) with 35% identity. The results from InterPro/InterProScan prediction appear to be consistent with the results of PDB alignments. By combining these two results, three protein fragments were selected for 3-H6: 1–131, 128–244, and 245–431 as the stable region. (2) if results of InterPro/InterProScan and PDB alignments were not consistent, but one of two results could be confirmed by DLF, the consistent results were manually combined and domain boundary positions were assigned. TM regions were integrated with the result from DLF at this stage as well. For example, 11020-H6, i.e. the region 299–792 of NP_493412 (Figure 5B), a 494-amino-acid protein without TM regions and a signal peptide, was predicted to have three possible domains/fragments by InterPro/InterProScan (Fragment1: 53–225; Fragment2: 236–494; Fragment3: 337–475) and no homologous protein structures were found by PDB alignment. DLF results showed that protein 11020-H6 may contain five possible domain linkers (DL1: 19–52; DL2: 106–145; DL3: 215–241; DL4: 325–330; DL5, 373–383), in which DL1 and DL3 were consistent with Fragment1, the N-terminal end of Fragment2; and DL4 was consistent with the N-terminal end of Fragment3. DL2 was ignored. Since Fragment3 was contained within Fragment2, it is possible that Fragment2 might contain at least two domains, and Fragment3 might be one of them. The final predicted stable domains/fragments of 11020-H6 were: 53–225, 236–494 and 331–494; (3) if results of InterPro/InterProScan and PDB alignments were not consistent, and no result from DLF was available or the DLF prediction didn't support any results from InterPro/InterProScan or PDB alignments, the N-terminus and C-terminus of the ORF would be treated as domain boundaries. After completing the prediction, a final check was performed to ensure that the region between two predicted domain boundaries should be at least 80 amino acids. If a predicted domain contained less than 80 amino acids, one of the two domain boundaries with a less reliability would be omitted and the domain was joined to the next domain/fragment, except that positive PDB alignment results were available and supported that the short predicted domain was long enough to form a stable domain.
Figure 5

Two examples for interpreting DDBP (domain/domain boundary prediction) method. A: According to the prediction of Interpro/InterProScan, 3-H6 (NP_508026), a 431-amino-acid protein that has no TM region or the signal peptide, possibly contained three domains: Domain1 (24–118), Domain 2 (141–234), and Domain 3 (254–370). a, b, c, d on the right of horizontal lines mark four separate alignment results between protein 3-H6 and Protein Data Bank (PDB) database. a: the region 4–131 of 3-H6 is homology with the region 20–147 of 1ROU with 60% identity; b: the region 4–244 of 3-H6 was similar to the region 41–280 of 1Q1C with 48% identity; c: the region 7–408 of 3-H6 was similar to the region 24-422 of a 1KTO/A with 40% identity; d: the region 128–428 of 3-H6 was similar to the region 22–330 of 1P5Q/A with 35% identity. By combining the results of Interpro/InterProScan and alignments, three protein fragments (1–131, 128–244, and 245–431) were selected for 3-H6 as stable domains/fragments. B: 11020-H6 (corresponding to the region 299–792 of protein NP_493412), a 494-amino-acid protein that has no TM regions or the signal peptide, was predicted to have three possible domains/fragments (Fragment1: 53–225; Fragment2: 236–494; Fragment3: 337–475) by InterPro/InterProScan (shown on top). DLF results showed that protein 11020-H6 may contain five possible domain linkers (DL1: 19–52; DL2: 106–145; DL3: 215–241; DL4: 325–330; DL5, 373–383) (shown at the bottom). The stable domains/fragments of 11020-H6 were predicted as 53–225, 236–494 and 331–494 by the DDBP method (shown as the conclusion in the box at right).

Two examples for interpreting DDBP (domain/domain boundary prediction) method. A: According to the prediction of Interpro/InterProScan, 3-H6 (NP_508026), a 431-amino-acid protein that has no TM region or the signal peptide, possibly contained three domains: Domain1 (24–118), Domain 2 (141–234), and Domain 3 (254–370). a, b, c, d on the right of horizontal lines mark four separate alignment results between protein 3-H6 and Protein Data Bank (PDB) database. a: the region 4–131 of 3-H6 is homology with the region 20–147 of 1ROU with 60% identity; b: the region 4–244 of 3-H6 was similar to the region 41–280 of 1Q1C with 48% identity; c: the region 7–408 of 3-H6 was similar to the region 24-422 of a 1KTO/A with 40% identity; d: the region 128–428 of 3-H6 was similar to the region 22–330 of 1P5Q/A with 35% identity. By combining the results of Interpro/InterProScan and alignments, three protein fragments (1–131, 128–244, and 245–431) were selected for 3-H6 as stable domains/fragments. B: 11020-H6 (corresponding to the region 299–792 of protein NP_493412), a 494-amino-acid protein that has no TM regions or the signal peptide, was predicted to have three possible domains/fragments (Fragment1: 53–225; Fragment2: 236–494; Fragment3: 337–475) by InterPro/InterProScan (shown on top). DLF results showed that protein 11020-H6 may contain five possible domain linkers (DL1: 19–52; DL2: 106–145; DL3: 215–241; DL4: 325–330; DL5, 373–383) (shown at the bottom). The stable domains/fragments of 11020-H6 were predicted as 53–225, 236–494 and 331–494 by the DDBP method (shown as the conclusion in the box at right). In order to validate this combination scheme, we constructed a dataset that contains the definition of 47 domains/fragments from our experimental results (see Method: Datasets for domain/domain boundaries prediction) and made a comparison between the experimental and DDBP prediction results (Table 2). In the comparison, the experimentally determined domain/domain boundaries are assumed to be correct domain/domain boundaries. For a protein, whatever how many domains it contained or were predicted, if only two boundaries of one predicted domain were same with those of one correct domain, or its ranges < = +10 aa, this prediction would be as an accurate prediction. Similarly, if 10 aa < ranges < = +30 aa, the prediction would be as a basically accurate prediction, and if range > +30 aa, the prediction would be as a wrong prediction. For example, protein 11011-D8 (Table 2) has one experimental determined domain: 45–190. With DDBP method, it was predicted with two possible domains: 1–107 or 52–190. Because one of predicted domains (52–190) was consistent with the correct result, i.e. ranges (52-45 = 7 and 190-190 = 0) < = +10 aa, this prediction was as a accurate prediction. Protein 4-F5 (Table 2) has one experimentally determined domain (1–144) and its predicted domains by DDBP method were 1–124 and 143–269. By comparison, 4-F5 was thought as a basically accurate prediction because its ranges (1-1 = 0 and 144-124 = 20) < = +30 and > +10 aa. The complete comparison for all 47 domains were listed in Table 2, as showed that more than 60% of the prediction was consistent with experimental results, in which 43% was accurate (labeled with I in the column A) and 19% was basically accurate (labeled with II in the column A).
Table 2

Comparisons between experimental and DDBP prediction results*

ABCDEFGHIJK
I11058-C7249nono4–190; 1–220; 80–190no24%, (7–240/8–268, 288); 26%, (4–240/7–225, 251)(1–249)#1–249NP_506406F20G2.1
I11048-D3199nono9–19976–98, 181–18126%, (5–160/3–164, 208)(1–199)$1–199NP_502315F35G2.2
I11011-D8190nono8–36, 47–75, 83–111; 2–107;148–172, 108–120, 82–92, 49–5127%, (37–105/1–69, 146)(45–190)&1–107, 52–190NP_493641F23F1.2
I18-A2210nono29–79; 108–194no30%, (75–193/4–115, 135)(74–210)&75–210NP_491893BAG1 (human) homolog family member (bag-1)
I11033-F3208nono6–74; 128–194; 1–97; 80–20739–5631%, (2–207/1–197, 198)(1–208)#1–208NP_496863Glutathione S-Transferase family member (gst-16)
I11-D11346nono80–317; 55–320; 219–31719–34, 116–12631%, (76–334/36–291, 298)(56–346)&55–346NP_491872C55B7.3
I11104-F4370nono2–221, 1–370126–143, 348–352, 19–25, 71–82, 292–297, 233–23734%, (128–347/15–231, 265)(125–370, 1–124)&1–125, 128–347NP_001040820Cell Division Cycle related family member (cdc-37)
I79-D4401nono65–395; 37–40066–108, 27–45, 108–125, 217–236, 138–14635%, (212–395/2– 185, 185)(206–401)#212–401NP_491735C06A5.7b
I9-H3212nonono19–52, 162–192, 79–8935%, (86–136/84– 131, 217)(59–212)$53–212NP_493365Y40B1B.5
I76-D4254nono2–171; 3–250; 139–167139–14736%, (3–251/8–265, 278)(1–254)#1–254NP_001021765Y47G6A.22
I8-C1142nono4–140no46%, (5–141/9–149, 150)(1–142)&1–142NP_499813T12D8.6
I11-F6327no209– 231, 246– 268207–227, 246–26656–95, 19–56, 95–136, 136–16150%, (141–167/1– 28, 163)(1–182, 1–145)&1–135NP_491774T09B4.5a
I1-F11229nono148–217; 170–201; 170–21219–21, 129–13459%, (135–220/20–107, 113)(135–229)#135–229NP_506367F53F4.3
I3-H6431nono24–118, 141–234; 254–370; 261–370397–413, 214–225, 100–107, 123–12860%, (4–131/20–147, 149); 48%, (4–244/41–280, 280); 40%, (7–408/24–422, 457); 35%, (128–428/22–330, 336);(1–135)#1–131, 128–244, 245–431NP_508026FK506-Binding protein family member (fkb-6)
I20-H6496nono38–496; 186–261, 293–363, 422–483; 183–272, 275–374, 422–487120–151, 265–27966%, (394–496/1– 103, 104) ; 75%, (183–269/1– 87, 90); 75%, (290–366/1– 78, 85)(169–385, 386–496)&183–272, 290–366, 394–496NP_001022967U2AF splicing factor family member (uaf-1)
I1-D102061–21no41–198; 23–77, 85–141, 143–196nono(23–206)#22–206NP_491320R12E2.13
I11020-H6**494nono53–225; 337–475; 336–453; 236–49419–52, 106–145, 215–241, 373–383, 325–330no(1–237, 238–494) &$53–225, 236–494, 331–494NP_493412Y37H9A.3
I70-H8130no107– 129109–12919–104no(1–130)#1–130NP_491052W03D8.3
I8-C9183nono125–159;nono(1–183)# ; (28–183, 23–183)$1–183NP_510277BMP receptor Associated protein family member (bra-1)
I11005-B8245nonono19–20, 129–136, 41–46no(9–245)#1–245NP_740981R05F9.1b
II18-F7288nono34–28619–30, 72–78, 196–202, 66–7045%, (35–286/21– 274, 276)(32–266)$34–286NP_001021584EXOnuclease family member (exo-3)
II4-F5592nono189–269485–509, 125–142, 19–37, 369–382,99–109, 311–31735%, (210–269/15–74, 76);(1–144)&1–124, 143–269NP_494544C16C8.16
II11011-C6162nono4–162; 5–54, 71–13251–68no(1–148, 1–124)&1–162NP_500324F42A6.6
II11058-H2249nono4–190; 4–230; 4–209no40%, (5–245/23– 263, 267)(1–222)$1–249NP_506407F20G2.2
II76-F10263nono23–130; 33–109118–126, 20–2027%, (27–120/6–98, 108)(1–129)&21–130T26031hypothetical protein W01A8.2
II79-H11245nono2–156, 1–241144–181, 181–204, 121–14451%, (2–154/2–154, 155)(1–185)$1–156NP_492567C03D6.5
II25–B11302nono23–127; 45–114203–236, 19–24, 236–248, 146–161, 187–195no(1–153)&23–145NP_492781B0511.7
II11-D3313nono28–214; 9–302151–174, 19–22no(1–313)# ; (213–313)$23–313NP_001021333Suppressor of PResenilin defect family member (spr-2)
II11058-F12272nono40–63; 40–68, 90–124no32%, (67–119/3–54, 60); 26%, (40–119/5–84, 87); 31%, (41–114/36– 113, 124)(1–147)&1–124NP_503566F36F12.8
III20-D7500nono311–395; 41–278; 283–436387–417, 288–308, 453–469, 481–48228%, (342–432/68–153, 289)(1–500)# ; (298–500, 388–500, 407–500)$1–287, 283–452NP_491868lariat DeBRanching enzyme related family member (dbr-1)
III37-G9245nonono206–227, 19–21no(1–102, 103–245)&1–245NP_507040F14H3.6
III70-D22651–2515–3769–243223–247, 195–19922%, (128–264/13–121, 135)(1–130, 1–174)&1–265AAC25860Hypothetical protein C37C3.3
III2-B6316nono27–294225–260, 174–191, 149–163, 296–298, 71–82, 84–84, 219–22023%, (53–168/46– 180, 201)(71–294)# ; (104–316)$&1–295NP_501422D2096.8
III10-E5274nono1–80, 108–190229–246, 193–21325%, (17–161/32– 166, 196)(65–237, 1–74, 75–274)&1–192NP_502380C25G4.6
III3-D2419nono95–128, 133–166; 93–197; 133–166196–217, 41–62, 259–275, 341–351, 19–2126%, (95–239/13– 144, 166); 32%, (99–197/8–95, 118)(140–309, 290–419)&93–197, 93–258NP_495087C17G10.2
III113-H8588nono232–342; 241–334492–515, 19–56, 81–132, 150–18529%, (234–328/2– 85, 105)(1–345, 34–313)&232–342NP_740981R05F9.1b
III76-F6803no769– 79117–250, 558–652; 10–34, 111–148, 382–402752–767, 19–24, 717–727, 280–28129%, (62–257/5– 200, 205); 29%, (83–257/1– 175, 181)(1–181)&25–257NP_491008alpha-CaTuliN (catenin/vinculin related) family member (ctn-1)
III4-A4569nono164–338, 369–527;187–200, 206–222, 263–279, 305–321, 321–335, 506–527; 263–318, 388–452, 511–56865–97, 19–38,112–13932%, (154–567/18–397, 402)(334–542, 334–501)&; (1–569)#154–549NP_495753associated with RAN (nuclear import/export) function family member (ran-3)
III25-H8339nono24–84; 24–75215–276, 111–169, 187–201, 100–10941%, (22–75/7–61, 70)(1–152)&1–99NP_495652T09A5.8
III2-H9356nono22–273; 1–108, 115–297; 13–284192–215, 303–312, 33842%, (1–284/1–289, 382)(1–335, 1–356) $&1–197NP_497949T23F11.1
III18-H1208nono32–124; 36–113; 24–45, 51–68, 131–145, 164–181, 186–205171–190, 114–171, 19–4043%, (41–110/10– 76, 90)(1–81)&$32–124NP_510410HIStone family member (his-24)
III11049-D6435nono293–433; 270–433110–152, 234–260, 175–186, 375–381no(1–156)# ; (9–158, 1–119)$&261–435NP_001041025Y41E3.7a
III9-G11250nono1–194228–232no(35–250)&1–227NP_497076R05H10.1
III10-E1251nonono158–199no(1–206)&1–251NP_496943W01G7.4
III37-F11230nono1–230106–126, 168–182no(45–179, 45–230)&1–230NP_507024T10C6.5
III75-A8228nonono19–44, 125–135no(1–189)$&1–228NP_492509F46A9.1
III11048-E2262nono33–178; 61–179183–198no(1–262)$1–182NP_501337MEChanosensory abnormality family member (mec-17)

* column A: Prediction accuracy level (I, accurate; II, basically accurate; III, wrong) and Plate-ID, users could query/search the sequence relative information from the SGCE web site [41]; column B: number of amino acids in the ORF; column C: signal peptide prediction results using SignalP; column D: transmembrane region prediction results using TMHMM; column E: domains/fragments from Interpro/InterProScan analysis; column F: domain linker regions predicted by Domain Linker Finder; column G: PDB-alignment results, including the percentage of sequence identity, query/subject sequence start position and end position, and the length of the subject sequence; column H: experimental results from protein crystal/three-dimensional structures (labeled with #), limited proteolysis (labeled with $) or spontaneous degradation (labeled with &); column I: DDBP prediction results; J: Accession Number, user could obtain sequence relative information from National Center for Biotechnology Information (NCBI) [40]; K: Definition of the protein or ACEID for proteins without known functions.

** 11020-H6 is corresponding to the region 299–792 of NP_493412.

Comparisons between experimental and DDBP prediction results* * column A: Prediction accuracy level (I, accurate; II, basically accurate; III, wrong) and Plate-ID, users could query/search the sequence relative information from the SGCE web site [41]; column B: number of amino acids in the ORF; column C: signal peptide prediction results using SignalP; column D: transmembrane region prediction results using TMHMM; column E: domains/fragments from Interpro/InterProScan analysis; column F: domain linker regions predicted by Domain Linker Finder; column G: PDB-alignment results, including the percentage of sequence identity, query/subject sequence start position and end position, and the length of the subject sequence; column H: experimental results from protein crystal/three-dimensional structures (labeled with #), limited proteolysis (labeled with $) or spontaneous degradation (labeled with &); column I: DDBP prediction results; J: Accession Number, user could obtain sequence relative information from National Center for Biotechnology Information (NCBI) [40]; K: Definition of the protein or ACEID for proteins without known functions. ** 11020-H6 is corresponding to the region 299–792 of NP_493412.

Application of the DDBP method and the improved cloning strategy

We applied the DDBP method and the improved cloning strategy to see if the success rate for obtaining purified soluble recombinant proteins would be greatly improved when the predicted fragments were cloned for expressing recombinant proteins in E. coli. The test dataset includes 57 proteins from C. elegans ORFeome version 3.1, whose expression/purification data of ORFs using the same expression vector were available from previous experiments. For these 57 proteins, the coding regions corresponding to the DDBP predicted fragments were subjected to HTP cloning, and the expression/purification pipeline, in which 14 ones were shortened constructs. Previously, all full-length proteins in this dataset, with the GATEWAY tags included at the N-terminus and the C-terminus, were treated as soluble by the 96-well expression profiling when expressed in E. coli. However, all but two proteins could not be purified from E. coli lysates prepared for expressing these proteins. Most of the recombinant proteins in this dataset were either unstable or formed large aggregates as shown by gel filtration chromatography. In contrast, after employing the DDBP method and improved cloning strategy that avoids GATEWAY encoded sequences, 50 proteins were expressed as soluble (Table 3, Figure 6), and until now, at least 20 were successfully purified (Table 3, Figure 7), among which four proteins had been crystallized (data not shown), despite that seven proteins were insoluble (Table 3, Figure 6). There is a 10-fold increase in terms of obtaining purified proteins from this dataset, as shows the combination of DDBP method and our cloning strategy is successful and results in a clearly improved protein expression and purification. However we do not know whether the observed improvement mainly deprives from a correct domain prediction since most proteins in our testing set only have the shortened or the full length construct and the completely comparison cannot be done.
Table 3

Constructs, soluble expression and purification results of 57 proteins used for testing DDBP method

Well

RowColumnAccession NumberStart positionEnd positionLength (aa)Protein definitionSoluble expression level (18°C)Soluble expression level (37°C)Purified
A2NP_4933551300300C01A2.5mediumhighyes
A3NP_0010227371264264X-box Binding Protein homolog family member (xbp-1)mediumhighnot
A7NP_4989471282282PeRoXisome assembly factor family member (prx-19)highhighYes
A9NP_4972261253253W06E11.4mediummediumnot
A10T269251195195hypothetical protein Y45F10C.5mediumnot solublenot
A11NP_4951461218218K05F1.9mediumhighnot
B1NP_4950621210210Helix Loop Helix family member (hlh-26)highhighnot
B2NP_4964221225225B0491.3highlowYes
B3NP_4954751197240F10E7.2not solublelownot
B4NP_49654721284284W03C9.1mediumhighnot
B5NP_49615629184184R53.8lowhighnot
B8NP_5011611250250F42C5.3mediumhighYes
B9NP_5021631319319C10C6.3highhighYes
B10NP_50077255348368ZK354.6not solublenot solublenot
B11NP_5017891297297F25H8.1highhighYes
C1NP_5018951294294R09E10.1mediumhighYes
C2NP_5008901243243H32C10.2highhighnot
C3NP_5019811388388R102.5alowlownot
C4NP_5070391196196F14H3.5highlownot
C5NP_5060941183183F23H12.3mediumlownot
C6NP_506094190183F23H12.3not solublemediumYes
C8NP_50596420260260T04F3.2mediummediumnot
C9NP_5064951252252D1086.4highhighnot
C10NP_7411131394419C32A3.3anot solublenot solublenot
C11NP_5011991299299F55G1.9not solublelownot
D1NP_4950211197197EEED8.12highhighnot
D6NP_5023151199199F35G2.2highhighYes
D10NP_4918691232232MeDiaTor family member (mdt-18)mediumlownot
E1NP_5019361190190F01D4.5bmediumlownot
E2NP_50692926144206F57A10.4not solublelownot
E3NP_50692926206206F57A10.4not solublenot solublenot
E5NP_4912101249249T12F5.1highhighnot
E7NP_50624535240240R186.3lownot solublenot
E8NP_4959411269308T24H10.1highmediumYes
E10NP_5102981269269AMP-Activated Kinase Beta subunit family member (aakb-1)mediumlownot
F1NP_4922851239239F02E9.5mediumlownot
F2NP_5097871195195F13E6.1mediumhighnot
F3AAZ828571230230Hypothetical protein C17H12.13not solublenot solublenot
F4NP_4933821210210Y87G2A.10highmediumYes
F5NP_4979901214214C38D4.9highhighYes
F10NP_4983911217217C56G2.15highhighYes
G2NP_4932301183183W02A11.2highnot solubleYes
G3NP_4920051189189F22D6.2highhighYes
G4NP_4926921206206Y106G6E.4highmediumYes
G5NP_4927951207207C34B2.5highmediumYes
G6NP_4917361214214C06A5.2mediumhighnot
G7NP_4913581233233ZK973.9lowmediumnot
G8NP_4923011240240D1081.9not solublenot solublenot
G9NP_492301165240D1081.9highhighYes
G11NP_4917211273273B0207.11highhighnot
H2NP_4919651274274T21G5.4highlowYes
H3NP_4913481287287Y47D9A.2anot solublenot solublenot
H4NP_49190327330363D2092.4highmediumnot
H5NP_4968031183183F15D4.2mediumlownot
H6NP_4914341177177C10H11.7lowlownot
H8NP_49552730179179F45E12.5bhighhighYes
H9NP_4943151276276F22E5.8not solublenot solublenot
Figure 6

Soluble expression results of 57 proteins used for testing DDBP method. ELISA results for soluble expression at 18°C and 37°C. Different shades in panels stand for different expression levels: the dark gray for the higher level, the gray for the medium level, the white for the lower level and the black for those not expressed, which was decided by comparisons with the positive control (A12 and B12, each containing one soluble protein). If ELISA readings of OD (optical density) at 405 nm was higher than or the same with the lower value of positive controls, the protein in this well was considered as expressed. Well C12 and D12 are negative controls and blank wells (white with no numbers) are null. After comparing the results at 18°C and 37°C, seven proteins (well B10, C10, E3, F3, G8, H3, and H9) were considered as not soluble.

Figure 7

Purification results of 57 proteins used for testing DDBP method. Purification results for 15 of the 20 purified proteins. The name of each SDS-PAGE gel includes 2 parts, for example B2 (NP_496422), B2 corresponds to the well showed in Figure 6 and Table 3, and NP_496422 is the accession number of the protein in the public database [40]. The bands labeled with ''Cut'' in the figure correspond to the results after the cleavage by the thrombin and those labeled with ''Uncut'' correspond to the results before the cleavage. ''Aa'' in the figure stands for the amino acid range of the purified proteins.

Soluble expression results of 57 proteins used for testing DDBP method. ELISA results for soluble expression at 18°C and 37°C. Different shades in panels stand for different expression levels: the dark gray for the higher level, the gray for the medium level, the white for the lower level and the black for those not expressed, which was decided by comparisons with the positive control (A12 and B12, each containing one soluble protein). If ELISA readings of OD (optical density) at 405 nm was higher than or the same with the lower value of positive controls, the protein in this well was considered as expressed. Well C12 and D12 are negative controls and blank wells (white with no numbers) are null. After comparing the results at 18°C and 37°C, seven proteins (well B10, C10, E3, F3, G8, H3, and H9) were considered as not soluble. Constructs, soluble expression and purification results of 57 proteins used for testing DDBP method Purification results of 57 proteins used for testing DDBP method. Purification results for 15 of the 20 purified proteins. The name of each SDS-PAGE gel includes 2 parts, for example B2 (NP_496422), B2 corresponds to the well showed in Figure 6 and Table 3, and NP_496422 is the accession number of the protein in the public database [40]. The bands labeled with ''Cut'' in the figure correspond to the results after the cleavage by the thrombin and those labeled with ''Uncut'' correspond to the results before the cleavage. ''Aa'' in the figure stands for the amino acid range of the purified proteins. NP_506094 and NP_492301 are two only proteins with shortened and full length constructs in the test dataset. Notably, the shortened constructs of these two proteins are successfully expressed and purified, while their full length constructs are not soluble or cannot be purified. Though this result has no statistic meaning for DDBP method, it at least affirms that the DDBP is an effective method for some kinds of protein to find proper domain/fragment from the ORF for recombinant protein expression.

Conclusion

In this paper we presented an effective HTP cloning pipeline and a domain/domain boundary prediction (DDBP) strategy. With this pipeline, four 96-well plates of genes could be cloned into an expression vector in seven days. After integrating the domain/domain boundary prediction strategy, the success rate of purification and crystallization was shown to increase dramatically. Moreover, this cloning pipeline, combined with our recombinant protein HTP expression pipeline and the crystal screening platform, constitutes a complete platform for structure genomics/proteomics. In the next stage, we will improve the accuracy of bioinformatics analysis of domain and domain boundaries and automates all bioinformatics procedures.

Methods

Genes for HTP cloning

A total of 90 genes from C. elegans ORFeome version 3.1 [5], 188 human genes from Human ORFeome versions 1.1 [6], and 88 genes from Brucella melitensis ORFeome version 1.1 [26] were used for evaluating the automated cloning modules. The cDNAs were provided by Dr. Vidal's group at Harvard Medical School as entry clones.

Datasets for domain/domain boundaries prediction

Domain definition for 47 proteins was derived from experimental results and the dataset was used for validating the domain/domain boundary prediction scheme. Among them, some domains were defined by protein crystals/three-dimensional structures; some were defined by limited proteolysis or spontaneous degradation (Table 2). The stable fragment from degraded samples was sequenced from the N-terminus and its molecular weight was determined by mass spectrometry. The domain definition was derived from the gene by starting at the N-terminus as sequenced and adding more amino acids in the gene sequence till the molecular weight matched that determined by mass spectrometry. This dataset was used to calibrate the domain/fragment prediction algorithm. Another dataset that has no relevant experimental information for domain definition was also used to examine this prediction method. This dataset included 57 proteins from C. elegans ORFeome version 3.1. Full-length sequences in this dataset have been inserted into expression vectors previously for expressing recombinant proteins in E. coli with the GATEWAY tags (data not shown).

Bioinformatics tools

BLAST [21] was used for alignments between our selected sequences and PDB [12] sequences. InterPro/InterProScan [19,20,36], was used to identify domain/fragment(s) of the ORF selected for generating a stable protein domain/fragment. Domain Linker Finder (DLF) [16,37] was used for finding possible domain linker regions. SignalP [22,23,38] and TMHMM [24,39] were used for prediction of the signal peptide and transmembrane (TM) regions. ExtractCDS, written in PERL, was developed as reported here and was used for extracting proper coding regions corresponding to selected domains. BatchPrimer, a comprehensive primer design program, was also developed here to carry out the batch primer design for the selected sequences.

Primer design and the PCR protocol for HTP cloning

We designed a PCR strategy of using two forward primers (F1, F2) and two backward primers (R1, R2) (Figure 1), modified from the strategy described by Kagawa and colleagues [34]. Primer F1 contains a part of the protease cleavage site followed by the gene specific sequence of 5'-terminus: CCACGCGGCAGC- 5'gene specific sequence. Primer R1 contains a part of the attB2 site followed by the gene specific sequence of the 3'-terminal: CAAGAAAGCTGGGTTA-3' gene specific sequence. Primer F2 contains the attB1 and the protease cleavage site: GGGGACAAGTTTGTACAAAAAAG CAGGCTTGGTGCCACGCGGCAGC, and R2 contains attB2 and the termination codon: GGGGACCACTTTGTACAAGAAAGCTGGGTTA. Gene specific regions in F1 and R1 are designed by BatchPrimer that would result in a pair of primers with a similar melting temperature (Tm) by adjusting the oligo length. The final Tm calculation was based on the formula of Breslauer and his colleagues [35], in which the salt concentration was set to 10 mM. The length of gene-specific oligos in the program was limited to between 20 to 30 bases according to our previous experimental results. Different DNA polymerases and different protocols were investigated. After a number of tests, we selected AccuPrime™ Pfx (Invitrogen) as our final choice of DNA polymerase, and a corresponding multi-step laddered PCR protocol was devised as described in Figure 2. PCR starts with primers F1, F2 (F1:F2 = 1:10) and R1, R2 (R1:R2 = 1:10) [34] for 34 cycles. Amounts of oligos, templates and the polymerase are decided according to AccuPrime™ Pfx user manual.
Figure 2

A multi-step laddered PCR Protocol. With this protocol, template DNA was amplified for 34 cycles with 5 minutes at 95°C for initial denaturation, 20 second at 94°C for denaturation, 30 second for annealing, 140 second at 68°C for extension and 10 minutes at 68°C for final extension. Annealing temperature was variable: it started from a relatively high temperature (55°C), and then decreased 1–2 degree each time until to 46°C. The temperature again increased 5 degree and stabilized at 51°C.

A multi-step laddered PCR Protocol. With this protocol, template DNA was amplified for 34 cycles with 5 minutes at 95°C for initial denaturation, 20 second at 94°C for denaturation, 30 second for annealing, 140 second at 68°C for extension and 10 minutes at 68°C for final extension. Annealing temperature was variable: it started from a relatively high temperature (55°C), and then decreased 1–2 degree each time until to 46°C. The temperature again increased 5 degree and stabilized at 51°C.

Gateway cloning and small-scale protein expression

After running the batch PCR protocol, 96-well E-Gel (Invitrogen) was used to check PCR outcomes. Entry clones were generated with entry vector pDONR201 (Invitrogen) and the PCR products by the BP reaction. BP reaction and transformation of DH5α cells were performed according to the GATEWAY protocols from the manufacturer (Invitrogen). Mini-prep was carried out with QIAGEN 96-well mini-prep kits. Expression vectors were prepared in 96-well plates with the selected entry clones and vector pET15g [4], via the LR reaction. Expression vectors were plated, and single colonies were selected for mini-prep. All above procedures (Figure 3), except for colony picking, were automated in our integrated robotic pipeline, operating mainly on a BiomekFX robot, as previously described [4]. For protein expression, expression vectors were transformed into E. coli BL21(DE3)AI firstly. Then pick single colonies for recombinant protein expression. After overnight growth at 37°C, the bacteria were diluted (1:200) into 0.6 ml culture containing 100 μg/ml ampicillin in two 96-well block assay plates. After growing for 3–4 hours, without monitoring the absorbance of the culture, protein expression was induced at 18°C and 37°C by addition of IPTG to a final concentration of 1 mM. Protein expression was carried out for 3 hours at 37°C and 20 hours at 18°C.

Cell lysis and Enzyme-linked Immunosorbent Assay (ELISA)

After protein expression, cells were spun down at 4000 rpm for 30 minutes and cell pellets were lysed by freezing overnight at -80°C and then thawed at room temperature for 15 minutes. Cell lysis was continued by shaking for 30 minutes at 1000 rpm in Vortemp shakers after the addition of 500 μl native lysis buffer (50 mM NaH2PO4, 300 mM NaCl, 10 mM imidazole, and 1 mg/ml lysozyme, pH 8.0). After lysis, plates were spun at 4000 rpm for 30 minutes and a Beckman Biomek FX robot was used to separate the supernatant, which contained only soluble proteins and was used for the solubility analysis of recombinant proteins by a dynamic indirect enzyme-linked immunosorbent assays (ELISA) protocol, from the pellet. Indirect ELISAs were carried out on a Beckman/Sagian core system: an ORCA robotic arm (Beckman) for moving plates, a Biomek 2000 (Beckman) for handling liquid, a Biotek plate washer (Bio-Tex Instruments) for washing plates, and a SpectraMax plate reader (Molecular Devices) for recording and analyzing results. A mouse anti-His tag antibody (Anti-Penta-His, QIAGEN) was used as the primary antibody at a dilution of 1:500 and a rabbit anti-mouse IgG Fc alkaline phosphatase conjugate (Pierce) was used as the secondary antibody also at a dilution of 1:500. p-Nitrophenyl phosphate (ICN) was used to stain according to the manufacturer's instructions. After read absorbance at 405 nm for 6 hours, with an interval of 30 minutes, the results were electronically compiled and automatically scored with in-house software.

Large scale expression/purification of soluble proteins and thrombin cleavage of purified proteins

Based on results of ELISA, we performed large scale expression on the possible soluble proteins with same protocols as described above, except enlarging the culture volume from 0.6 ml to 6 liters and inducing cells when absorbance values at 595 nm reached 0.6 to 0.8. After the appropriate incubation (3 hours at 37°C or 20 hours at 18°C), cells were harvested by centrifugation (7000 rpm for 12 minutes). Cell pellets were then re-suspended in appropriate amount of binding buffer (for Ni-His6 affinity column, 20 mM Tris, 500 mM NaCl, 5 mM imidazole, and 0.01% NaAzide, pH 7.9) and completely lysed by sonicating. After centrifuge lysate for 30 minutes at 17000 rpm, remove the pellet and filter lysate through Watmann paper. Collected proteins were firstly purified by use of the Ni-nitrilotriacetic acid agarose (Qiagen) affinity chromatography: the protein mixture was loaded to the column, and after washed the column, the proteins were eluted under native conditions (500 mM imidazole, 20 mM Tris, 500 mM NaCl, 0.01% NaAzide, pH7.9). Obtained proteins were then concentrated, and further purified by use of the standard protocols with ion-exchange (Hitrap Q column, Amersham) and size exclusion chromatography (superdex75 or superdex200 column, Amersham). Purified proteins will finally be treated with thrombin (Sigma). For any purified proteins, before treatment with thrombin, a small amount of them were used for optimizing thrombin cutting concentrations: at room temperature, proteins were digested at a series of thrombin concentrations (0.1, 0.5, 1, and 5 unit per milligram of target protein) for 16 hours, and the concentration with the best result was chosen as the actual one. If digestion results were not good enough, try to increase or degrease the amount of thrombin and test again. Once the thrombin concentration was decided, the purified protein was mixed with proper amounts of thrombin and dialyzed in low salt buffer (20 mM Tris, 100 mM NaCl, pH7.5) at 4°C for 16 hours. Resulted proteins were checked by Sodium Dodecyl Sulfate Polyacrylamide Gel Electrophoresis (SDS-PAGE) and used in crystallization trials.

Authors' contributions

YC developed programs for primer design and sequence extraction (BatchPrimer and ExtractCDS), devised the multi-step laddered PCR protocol, developed the domain/fragment selection method, carried out most gene cloning and protein expression experiments and drafted the manuscript. SQ participated in gene cloning and protein expression experiments. CL participated in gene cloning and protein expression experiments and automated the HTP cloning pipeline. ML conceived of the study, and participated in its design and coordination and helped to write the manuscript. All authors read and approved the final manuscript.
  32 in total

1.  The Protein Data Bank.

Authors:  H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites.

Authors:  H Nielsen; J Engelbrecht; S Brunak; G von Heijne
Journal:  Int J Neural Syst       Date:  1997 Oct-Dec       Impact factor: 5.866

Review 3.  Machine learning approaches for the prediction of signal peptides and other protein sorting signals.

Authors:  H Nielsen; S Brunak; G von Heijne
Journal:  Protein Eng       Date:  1999-01

4.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

5.  A hidden Markov model for predicting transmembrane helices in protein sequences.

Authors:  E L Sonnhammer; G von Heijne; A Krogh
Journal:  Proc Int Conf Intell Syst Mol Biol       Date:  1998

6.  Predicting DNA duplex stability from the base sequence.

Authors:  K J Breslauer; R Frank; H Blöcker; L A Marky
Journal:  Proc Natl Acad Sci U S A       Date:  1986-06       Impact factor: 11.205

7.  Automated protein sequence database classification. II. Delineation Of domain boundaries from sequence similarities.

Authors:  J Gracy; P Argos
Journal:  Bioinformatics       Date:  1998       Impact factor: 6.937

8.  Generation of the Brucella melitensis ORFeome version 1.1.

Authors:  Amélie Dricot; Jean-François Rual; Philippe Lamesch; Nicolas Bertin; Denis Dupuy; Tong Hao; Christophe Lambert; Régis Hallez; Jean-Marc Delroisse; Jean Vandenhaute; Ignacio Lopez-Goñi; Ignacio Moriyon; Juan M Garcia-Lobo; Félix J Sangari; Alastair P Macmillan; Sally J Cutler; Adrian M Whatmore; Stephanie Bozak; Reynaldo Sequerra; Lynn Doucette-Stamm; Marc Vidal; David E Hill; Jean-Jacques Letesson; Xavier De Bolle
Journal:  Genome Res       Date:  2004-10       Impact factor: 9.043

9.  Human ORFeome version 1.1: a platform for reverse proteomics.

Authors:  Jean-François Rual; Tomoko Hirozane-Kishikawa; Tong Hao; Nicolas Bertin; Siming Li; Amélie Dricot; Ning Li; Jennifer Rosenberg; Philippe Lamesch; Pierre-Olivier Vidalain; Tracey R Clingingsmith; James L Hartley; Dominic Esposito; David Cheo; Troy Moore; Blake Simmons; Reynaldo Sequerra; Stephanie Bosak; Lynn Doucette-Stamm; Christian Le Peuch; Jean Vandenhaute; Michael E Cusick; Joanna S Albala; David E Hill; Marc Vidal
Journal:  Genome Res       Date:  2004-10       Impact factor: 9.043

10.  C. elegans ORFeome version 3.1: increasing the coverage of ORFeome resources with improved gene predictions.

Authors:  Philippe Lamesch; Stuart Milstein; Tong Hao; Jennifer Rosenberg; Ning Li; Reynaldo Sequerra; Stephanie Bosak; Lynn Doucette-Stamm; Jean Vandenhaute; David E Hill; Marc Vidal
Journal:  Genome Res       Date:  2004-10       Impact factor: 9.043

View more
  3 in total

1.  Replication methods and tools in high-throughput cultivation processes - recognizing potential variations of growth and product formation by on-line monitoring.

Authors:  Robert Huber; Thomas G Palmen; Nadine Ryk; Anne-Kathrin Hillmer; Karina Luft; Frank Kensy; Jochen Büchs
Journal:  BMC Biotechnol       Date:  2010-03-16       Impact factor: 2.563

2.  A plasmid toolkit for cloning chimeric cDNAs encoding customized fusion proteins into any Gateway destination expression vector.

Authors:  Raquel Buj; Noa Iglesias; Anna M Planas; Tomàs Santalucía
Journal:  BMC Mol Biol       Date:  2013-08-20       Impact factor: 2.946

3.  Conservation of helical bundle structure between the exocyst subunits.

Authors:  Nicole J Croteau; Melonnie L M Furgason; Damien Devos; Mary Munson
Journal:  PLoS One       Date:  2009-02-13       Impact factor: 3.240

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.