Literature DB >> 16844998

sgTarget: a target selection resource for structural genomics.

Ana P C Rodrigues¹, Barry J Grant, Roderick E Hubbard.

Abstract

sgTarget (http://www.ysbl.york.ac.uk/sgTarget) is a web-based resource to aid the selection and prioritization of candidate proteins for structure determination. The system annotates user submitted gene or protein sequences, identifying sequence families with no homologues of known structure, and characterizing each protein according to a range of physicochemical properties that may affect its expression, solubility and likelihood to crystallize. Summaries of these analyses are available for individual sequences, as well as whole datasets. This type of analysis enables structural biologists to iteratively select targets from their genomic sequences of interest and according to their research needs. All sequence datasets submitted to sgTarget are available for users to select and rank using their choice of criteria. sgTarget was developed to support individual laboratories collaborating in structural and functional genomics projects and should be valuable to structural biologists wishing to employ the wealth of available genome sequences in their structural quests.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2006 PMID： 16844998 PMCID： PMC1538879 DOI： 10.1093/nar/gkl121

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The first step in any structure determination project is to select the appropriate molecule for study. Selection strategies vary according to the scientific context and aims of the project. In structural genomics, which aims to determine the structure of all important bio-molecules, the large number of potential candidates complicates the selection process. It is therefore important to identify the molecules for which a structure (normally of a protein) will provide the highest new information content and, where possible, quantify measures of how tractable each molecule is for structure determination (1,2). Evolutionary constraints can be used to identify proteins that may adopt similar conformations to known protein structures. For these proteins, modeling approaches may provide sufficient information to understand structure and mechanism. Certain sets of protein characteristics can be inferred from its sequence and employed in the identification of proteins that may pose problems during the various stages of structure determination. For example, fibrous domains can frustrate single crystal formation protocols and may frequently be identified by examining the protein's amino acid sequence (e.g. certain coiled coils). Structural biology groups wishing to select and prioritize targets from raw sequence data may currently use genomic annotation servers, such as PEDANT (3) or 3D-Genomics (4). These automated services contain gene and protein annotations for a number of completed genomes. Although they detail annotations of relevance to the selection procedure no user accessible mechanism exists for generating target lists. sgTarget was specifically designed to enable structural biologists to submit their sequence of interest and to select and rank targets according to their choice of criteria. A simple web interface can be used to generate and download target lists that may be iteratively refined by users. The resource was developed to assist individual laboratories participating in structural and functional genomics consortiums, as necessitated by our laboratory's involvement in the Structural Proteomics IN Europe (SPINE) consortium () and the Plasmodium Functional Genomics Initiative ().

THE sgTarget ANNOTATION PIPELINE

A sequence annotation pipeline forms the core of the resource. This carries out the determination and prediction of properties and relationships that can be used in the selection of suitable targets. The pipeline consists of a set of bioinformatics methods that were selected and incorporated into the resource's framework, as follows: Methods to predict protein fold, function and prevalence. These help to identify targets, such as proteins for which fold predictions cannot yet be established, those with unknown functions, or ORFan proteins. Assessment of known protein expression and crystallization issues. Nucleotide sequence based calculations determine the encoding gene's GC content, codon usage and its compatibility with that of the host expression system (the Codon Adaptation Index). These metrics can highlight potential problems for protein expression. Similarly, sequence based prediction of protein instability, solubility and half-life can identify issues for high throughput structure determination. Assessment of known protein structure issues. Protein sequence based calculations predict the locations of intrinsically disordered, fibrous or transmembrane regions. The presence of these features can pose challenges for structure determination. The majority of protocols employed by the annotation pipeline use established bioinformatics methods and databases (listed in Table 1). A novel procedure for the identification of intrinsically disordered regions was developed (5) and is described briefly below. In addition, tailored thresholds were established for GC content (between 26.9 to 66.8% for the expression host Escherichia coli), Codon Adaptation Index (above 0.084 for expression in E.coli, and above 0.357 for high levels of expression) and E-value cutoffs to assess the structural significance of BLAST alignments (two cutoffs are employed by the resource: 2.07 × 10−11, a conservative threshold and 2.15 × 10−4, a ‘natural’ threshold with a false positive rate of 0.2%).

Table 1

Software, databases and selected protocols employed in sgTarget's annotation pipeline

Software	Application
CodonW^a	Calculate the relative conformance of a gene to an organism's genome (the Codon Adaptation Index)
BLAST (18)	Perform local protein sequence similarity searches against PDB and NRDB sequences
InterProScan (19)	Run sequence comparison methods required to search the InterPro database (as well as NCOILS (20) to identify coiled-coil domains)
SEG (9)	Detect and isolate subsequences with high or low-complexity
TMHMM (21)	Predict the location and topology of protein transmembrane regions
Database	Description
PDB SEQRES (22)	Protein sequences derived from the SEQRES card of PDB files
InterPro (23)	Integrated collection of the protein domain family databases (Pfam, PRINTS, ProDom, PROSITE, SMART, TIGRFAMs and PANTHER)
GO (24)	Function ontology database with mappings to InterPro
NRDB	Collection of protein sequence databases (PIR, SWISS-PROT, TrEMBL and PDB SEQRES)
Taxonomy^b	Taxonomical classification of organisms cross-referenced by NRDB
Protocol	Description & Application
Instability index (25)	The instability index is a length-scaled measure of the occurrence of all dipeptides in a protein sequence. Guruprasad and colleagues found a correlation between this measure and protein stability: in general, stable proteins have instability indices smaller than 40.
Estimate half-life using the N-end rule (26)	Estimates of in vivo half-life for proteolysis of proteins in prokaryotes can be made by the N-end rule. This considers the presence of a destabilizing N-terminal residue that provides an N-degron degradation signal.
Wilkinson–Harrison solubility index (27,28)	The revised Wilkinson–Harrison statistical solubility model depends on two parameters: the fraction of residues with a high index for forming turns and the approximate average charge of the protein in vivo. This model has been shown to be useful in the selection of proteins with high solubility.

aCodonW ().

bTaxonomy ().

Identification of intrinsically disordered regions

Intrinsically disordered domains can cause a multitude of adverse effects in structural determination studies, including purification difficulties due to hypersensitivity to protease digestion, missing electron density due to incoherent X-ray scattering, hindered crystallization, extreme broadening of side chain NMR peaks and lack of chemical shift dispersion of NMR backbone data. Some of these segments may become ordered upon interaction with binding partners to perform specific functions (6). Their structural characterization would, however, be difficult even if prior knowledge of the required cofactors was available. The annotation pipeline employs the charge-hydrophobicity phase-space boundary of Uversky et al. (7), complemented by the putative lower bound complexity threshold of Romero and colleagues (8), to predict regions of intrinsic disorder. The low-complexity detection software SEG isolates subsequences with high or low-complexity on the basis of information content (9). In sgTarget, SEG is employed to detect any subsequences of at least 45 residues and a complexity value lower than 2.90. Such regions are annotated as probable non-globular protein stretches. For the remaining subsequences the mean hydrophobicity [the sum of the normalized hydrophobicities from (10) divided by the number of residues] and the mean net charge at pH 7.0 are calculated, and used in Equation 1, to predict if a subsequence is likely to be intrinsically disordered. Uversky and colleagues found that disordered proteins have low overall hydrophobicity and high net charge, always falling below the boundary: where 〈H〉 is the mean hydrophobicity and 〈R〉 is the mean net charge (7,11). The performance of sgTarget's disorder prediction method on the CASP5 disorder benchmark was evaluated (12). sgTarget's disorder predictions for those targets that are least related to a protein with known structure, achieved an accuracy of 0.77 (where accuracy is the arithmetic mean of sensitivity and specificity measured on a per residue basis), which compares favorably to previously reported methods. Hence, the method is suitable to analyze datasets where there may be many new folds, such as the complete genomes that serve as input to the resource. In summary, the annotation methods employed by sgTarget allow the identification and prediction of a wide range of properties for each putative target. These enable users to filter and prioritize proteins and genes, generating lists of targets to suit diverse requirements.

THE sgTarget SERVER

A web-based interface has been developed to interact with the sequence annotation pipeline. This allows users to analyze genomic sequences of interest by submitting them to the server, interact with the resulting data by browsing or searching and to select and prioritize targets for structural determination according to their choice of criteria. The interface is available at and its functionality is divided into three main pages: Load, View and Target.

Load

The Load page allows users to submit their sequences of interest through an anonymous interface. Requests are submitted to the annotation pipeline and processed sequentially. Annotations for an average bacterial chromosome (∼5 Mb or ∼4000 protein coding genes) take ∼24 h to complete. Users can choose to be notified of progress by e-mail on initiation and on completion of annotations. Depending on the level and nature of user requests, there may need to be some prioritization and arbitration on the order and choice of which organisms or datasets are annotated.

View

The View page allows users to analyze the sequence annotations performed by the resource. Users can browse through the annotations for a dataset using the Browse function. Here detailed annotations are available for individual proteins, and global synopses are available for the dataset's characteristics. Browsing the data by protein enables users to investigate the results of all the calculations obtained through the annotation pipeline for a particular gene/protein sequence. This includes gene information, such as GC content and codon usage, protein information, such as function, structure and prevalence predictions, and information on the suitability of the target for structural studies, such as the number of transmembrane, disordered and coiled-coil regions, and the protein's physicochemical properties. Browsing the data by characteristic enables users to investigate the results of a particular set of calculations for that dataset. This includes global statistics for gene expression predictions, structural and functional annotations, prevalence assignments, transmembrane and non-globular regions predictions, as well as physicochemical properties. Within the View page, users can also search each subset using the Search function. It allows users to find proteins using the resource's own identifier, as well as other identifiers (GenBank accession no.) and names (sequencing center naming), as provided by the sequence input files.

Target

The Target page enables users to select and prioritize targets. The Select function is used to specify the datasets to target, which gene and protein properties the targets should possess, and what parameters and thresholds should be employed in the selection (Figure 1). All annotations established through the annotation pipeline can be employed as selection parameters. Upon selecting targets, users are presented with the Rank function, which enables them to perform target prioritization (Figure 2). This function also allows users to choose the format and layout of the target list, which is finally presented to them (Figure 3).

Figure 1

Target page with Select function activated. The menu area (on the left) allows users to choose one or more sequence datasets to target. The work area (on the right) allows users to specify selection criteria. In this example, the Mycoplasma genitalium genome has been chosen for targeting. The selection criteria specify that genes must have a GC content and CAI that is optimal for E.coli, and proteins have no homologues with known structure, are likely to be stable, viable in E.coli for at least 2 h, have at most one transmembrane region, and no fibrous or disordered regions (sgTarget's default selection criteria). When users click the OK button they are presented with the Rank function, and asked to choose how the target list should be prioritized and displayed (shown in Figure 2).

Figure 2

Target page with Rank function activated. The menu area (on the left) shows a summary of the results returned by the Select function. The work area (on the right) allows users to specify which data to display for the selected targets, and how to rank those targets by specifying the priority of each annotation. Users can choose to view the prioritized target list as a Web page (by clicking the HTML button) or, alternatively, as a tabbed text file (by clicking the TEXT button). In this case, 49 targets were selected with the criteria specified in Figure 1. The target list is to be ranked with decreasing coverage by NRDB database (i.e. proteins with more of their length annotated as similar to a protein in the NRDB database have higher priority) and a number of protein physicochemical properties are to be displayed along with the default attributes (off the screen in this screenshot) (see Figure 3 for resulting page).

Figure 3

Target page showing a target list. The selected targets are ranked according to the order and priority specified for the different annotations, and a table of prioritized targets is built using the annotations that were chosen for display. In this case, a list of 49 targets (selected from M.genitalium's genome with the criteria specified in Figure 1) was ranked by decreasing coverage by NRDB database proteins, and a table constructed showing the target's identifier (in sgTarget), accession number, name, molecular weight, length, GRAVY score, isoelectric point, coverage by NRDB database proteins (including the span of the alignments on the target and the top taxonomic group which encompasses all reported alignments) and function annotation (the top InterPro hit and its GO high-level molecular function) (as specified in Figure 2).

APPLICATION

sgTarget has underpinned the selection of targets for our laboratory's collaboration in the Plasmodium Functional Genomics Initiative. The resource was employed to annotate the genome of Plasmodium falciparum, the organism that causes the most fatal form of human malaria (Figure 4). This enabled the generation of a target list by refining the selection choices to consider parameters selected by researchers in the group. The initial list of 73 targets consists of malaria proteins encoded by single exon genes with GC contents higher than 30%, no transmembrane regions and no long non-globular hydrophilic regions. GC content and intron number are the most selective of the parameters, together reducing the number of possible targets by 98%. These selection criteria were chosen to identify proteins likely to express in E.coli, and initial results obtained by the group indicate that the target list has been successful on those terms (13). Thus far, the group have initiated work on 10 of these targets, successfully cloned and expressed 8, purified 6, of which 1 is in crystallization trials [and has also been shown to be crucial for the parasite's invasion of human red blood cells, (14)] and 3 have already yielded high-resolution structures (15,16) and Boucher, I., Brzozowski, A.M., Brannigan, J.A., Schnick, C., Smith, D., Kyes, S. and Wilkinson, A.J., manuscript in preparation.

Figure 4

P.falciparum annotation wheel, with an emphasis on structural annotation. Annotations are displayed anti-clockwise as follows: A total of 1055 proteins have structural annotations, 691 high-confidence and 364 low-confidence (PDB SEQRES, release 05/2002); Of the remaining proteins, 3714 are likely to be intractable: 1475 have transmembrane regions, a further 2131 have disordered regions and the other 108 have fibrous regions; For the remainder of the proteome, 187 proteins have function annotations, although only 97 of these are classified by GO; Most other proteins are found in other organisms (295), except for 16 ORFan proteins.

In addition, sgTarget has been employed to select a set of Bacillus anthracis target proteins for the SPINE consortium. Here, the resource was used in tandem with the bioinformatics tools available at the Oxford Protein Production Facility () to select a set of proteins of desirable molecular weight (20 to 55 kDa), which are likely to be soluble (insolubility probability smaller than 0.7) (17). We encourage structural biologists to submit sequence datasets to sgTarget and contact us regarding suggestions on software and databases for the annotation pipeline, the annotation views provided by sgTarget and the functionality of the Target page.

25 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Why are "natively unfolded" proteins unstructured under physiologic conditions?

Authors: V N Uversky; J R Gillespie; A L Fink
Journal: Proteins Date: 2000-11-15

Review 3. Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm.

Authors: P E Wright; H J Dyson
Journal: J Mol Biol Date: 1999-10-22 Impact factor: 5.469

4. Sequence complexity of disordered protein.

Authors: P Romero; Z Obradovic; X Li; E C Garner; C J Brown; A K Dunker
Journal: Proteins Date: 2001-01-01

5. The InterPro Database, 2003 brings increased coverage and new features.

Authors: Nicola J Mulder; Rolf Apweiler; Teresa K Attwood; Amos Bairoch; Daniel Barrell; Alex Bateman; David Binns; Margaret Biswas; Paul Bradley; Peer Bork; Phillip Bucher; Richard R Copley; Emmanuel Courcelle; Ujjwal Das; Richard Durbin; Laurent Falquet; Wolfgang Fleischmann; Sam Griffiths-Jones; Daniel Haft; Nicola Harte; Nicolas Hulo; Daniel Kahn; Alexander Kanapin; Maria Krestyaninova; Rodrigo Lopez; Ivica Letunic; David Lonsdale; Ville Silventoinen; Sandra E Orchard; Marco Pagni; David Peyruc; Chris P Ponting; Jeremy D Selengut; Florence Servant; Christian J A Sigrist; Robert Vaughan; Evgueni M Zdobnov
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

6. The PEDANT genome database.

Authors: Dmitrij Frishman; Martin Mokrejs; Denis Kosykh; Gabi Kastenmüller; Grigory Kolesov; Igor Zubrzycki; Christian Gruber; Birgitta Geier; Andreas Kaps; Kaj Albermann; Andreas Volz; Christian Wagner; Matthias Fellenberg; Klaus Heumann; Hans-Werner Mewes
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

7. Predicting the solubility of recombinant proteins in Escherichia coli.

Authors: D L Wilkinson; R G Harrison
Journal: Biotechnology (N Y) Date: 1991-05

8. A hidden Markov model for predicting transmembrane helices in protein sequences.

Authors: E L Sonnhammer; G von Heijne; A Krogh
Journal: Proc Int Conf Intell Syst Mol Biol Date: 1998

9. A simple method for displaying the hydropathic character of a protein.

Authors: J Kyte; R F Doolittle
Journal: J Mol Biol Date: 1982-05-05 Impact factor: 5.469

10. dUTPase as a platform for antimalarial drug design: structural basis for the selectivity of a class of nucleoside inhibitors.

Authors: Jean L Whittingham; Isabel Leal; Corinne Nguyen; Ganasan Kasinathan; Emma Bell; Andrew F Jones; Colin Berry; Agustin Benito; Johan P Turkenburg; Eleanor J Dodson; Luis M Ruiz Perez; Anthony J Wilkinson; Nils Gunnar Johansson; Reto Brun; Ian H Gilbert; Dolores Gonzalez Pacanowska; Keith S Wilson
Journal: Structure Date: 2005-02 Impact factor: 5.006

3 in total

1. Retrospective analyses of the bottleneck in purification of eukaryotic proteins from Escherichia coli as affected by molecular weight, cysteine content and isoelectric point.

Authors: Won Bae Jeon
Journal: BMB Rep Date: 2010-05 Impact factor: 4.778

2. Consensus prediction of protein conformational disorder from amino acidic sequence.

Authors: Suresh Kumar; Oliviero Carugo
Journal: Open Biochem J Date: 2008-01-18

3. TarO: a target optimisation system for structural biology.

Authors: Ian M Overton; C A Johannes van Niekerk; Lester G Carter; Alice Dawson; David M A Martin; Scott Cameron; Stephen A McMahon; Malcolm F White; William N Hunter; James H Naismith; Geoffrey J Barton
Journal: Nucleic Acids Res Date: 2008-04-02 Impact factor: 16.971

3 in total