Literature DB >> 21051360

SAHG, a comprehensive database of predicted structures of all human proteins.

Chie Motono¹, Junichi Nakata, Ryotaro Koike, Kana Shimizu, Matsuyuki Shirota, Takayuki Amemiya, Kentaro Tomii, Nozomi Nagano, Naofumi Sakaya, Kiyotaka Misoo, Miwa Sato, Akinori Kidera, Hidekazu Hiroaki, Tsuyoshi Shirai, Kengo Kinoshita, Tamotsu Noguchi, Motonori Ota.

Abstract

Most proteins from higher organisms are known to be multi-domain proteins and contain substantial numbers of intrinsically disordered (ID) regions. To analyse such protein sequences, those from human for instance, we developed a special protein-structure-prediction pipeline and accumulated the products in the Structure Atlas of Human Genome (SAHG) database at http://bird.cbrc.jp/sahg. With the pipeline, human proteins were examined by local alignment methods (BLAST, PSI-BLAST and Smith-Waterman profile-profile alignment), global-local alignment methods (FORTE) and prediction tools for ID regions (POODLE-S) and homology modeling (MODELLER). Conformational changes of protein models upon ligand-binding were predicted by simultaneous modeling using templates of apo and holo forms. When there were no suitable templates for holo forms and the apo models were accurate, we prepared holo models using prediction methods for ligand-binding (eF-seek) and conformational change (the elastic network model and the linear response theory). Models are displayed as animated images. As of July 2010, SAHG contains 42,581 protein-domain models in approximately 24,900 unique human protein sequences from the RefSeq database. Annotation of models with functional information and links to other databases such as EzCatDB, InterPro or HPRD are also provided to facilitate understanding the protein structure-function relationships.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Ligands
Proteins

Year: 2010 PMID： 21051360 PMCID： PMC3013665 DOI： 10.1093/nar/gkq1057

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Nowadays, genome sequencing projects are producing complete genome sequences at an extremely high rate (1,2). With the rise of next-gen sequencers (3–5), this is the continuous trend for the future without a doubt. Consequently, the number of known protein sequences (6) grows more rapidly than the number of known protein structures experimentally determined (7). However, to make full use of genome sequences, proteins encoded in genomes should be analysed and for this purpose, protein three-dimensional (3D) structures provide much information (8,9). Computational methods for protein 3D structure prediction are anticipated to bridge the gap between the number of known protein sequences and the number of known protein structures. According to assessments of the accuracy of those methods, e.g. recent Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiments (10,11), template-based protein structure prediction often produced 3D models accurate enough for functional annotations, modification of protein functions or even for structure-based drug design (12,13). In addition, in the CASP7 and 8 experiments, fully automated structure prediction methods had reached a comparable level to the best prediction performance by methods with human intervention (14). In the CASP experiments, target protein sequences are ones whose 3D structures will be determined. It means that such protein structures are expected to be single domains or a couple of domains and suitable for the experimental structure determination. Therefore, sometimes protein sequences are truncated from their full-length forms. On the other hand, most protein sequences coded in genomes from higher organisms are known to be long and should be multi-domain proteins (15), and contain a significant portion of intrinsically disordered (ID) regions (16–19). Clearly, these proteins are unsuitable for experimental structure determination in the full-length form and distinct from the target protein sequences of CASPs. To analyse such proteins, we have developed a special protein-structure-prediction pipeline, by integrating and arranging various computational tools, either developed by us or widely used as global standards. This pipeline was applied to all proteins coded in the human genome. The resulting 3D models as well as other annotations for protein functions were accumulated in the Structural Atlas of Human Genome (SAHG) database and presented through the web interface at http://bird.cbrc.jp/sahg. There are other databases of protein structure models, e.g. SWISS-MODEL Repository (20) or ModBase (21). Both databases contain annotated protein structure models generated by original automated modeling pipelines. They also allow the users to build models on demand. Compared with them, the SAHG database is distinct mainly in the following points: (i) The 3D models in SAHG were generated by an original pipeline, specific for multi-domain proteins with substantial ID regions; (ii) Conformational changes of proteins upon ligand-binding are predicted by simultaneous modeling using templates of the ligand-bound state (holo form) and the unbound state (apo form) and displayed as animated images; and (iii) Functional annotations for protein interactions, e.g. ligand-binding and protein–protein interactions, are available. All these features are suitable for analysing eukaryotic proteins toward a deep understanding of their functions and interactions.

PREDICTION SCHEME AND CONTENTS

Overview

Schematically, two types of prediction systems were used to analyse protein sequences [RefSeq sequence (22)] automatically. One is the ‘Structure prediction pipeline’ (right pink regions in Figure 1) in which several homology search and protein structure prediction tools, conducting sequence–sequence, sequence–profile and profile–profile alignments, are combined sequentially, and it processes protein sequences, assigns them with 3D templates and finally produces 3D models. If available, 3D models of apo and holo forms were generated. The other components are ‘Other structure and function predictors’ (bottom light blue regions in Figure 1). They are an ensemble of independent prediction tools, which analyse protein sequences. All the results from these systems were accumulated in SAHG in XML formats.

Figure 1.

SAHG prediction systems. ‘Structure prediction pipeline’ and ‘Other structure and function predictions’ are shown in the right pink regions and bottom light-blue regions, respectively. The center panel illustrates each procedure in the flow of the structure prediction pipeline, showing how the results of systems are integrated. SWPPA: Smith–Waterman profile–profile alignment method; ID: intrinsically disordered; ENM: elastic network model.

Structure prediction pipeline

Construction of 3D models

Protein structure prediction consists of the following procedures: template searches and selection, alignment of target sequence and template, building 3D models and evaluation of model quality. The template searches and their assignments to a target protein are the ‘step-wise-multi-methods’ approach. In the first step, a BLAST (23) search against all the latest Protein Data Bank (PDB) (7) and Structure Classification of Proteins (SCOP) (24,25) sequences is performed with 10−5 E-value cut-off. We selected templates, at least 90% of whose sequence could be aligned with the target, to ensure that the 3D models corresponded to stable domains or proteins. The resulting target sequence-template alignments were ranked based on their E-values. The best combination of templates for each domain was determined using an original algorithm to maximize the coverage of the target sequence (label I in Figure 1). In the second step, a PSI-BLAST (23) search with the same parameters was conducted for the remaining regions of the target sequence, where no models had been assigned and the best templates were assigned onto the target sequence (II in Figure 1). Protein sequence profiles were prepared using the latest NCBI-nr database. In the third step, a Smith–Waterman profile–profile alignment method (SWPPA) (26) was applied to the remaining regions against restricted templates (SCOP and PDB subsets with less than 40% sequence identity) with a cut-off of Z-score > 10, the comparable threshold to E-value < 10−5 in PSI-BLAST (III in Figure 1). Finally, the FORTE (27) search, a profile–profile comparison method, was performed for the remaining regions, with a strict cut-off of Z-score > 20, to detect distantly related templates (V in Figure 1). FORTE is based on the global–local alignment method and was adjusted to perform best (28) when the target proteins were almost the same length as the PDB entries (around 400 aa) (29). However, more than half of human proteins (53%) are larger than 400 amino acids and even the remaining regions are sometimes over 2000 amino acids. Thus, prior to the FORTE search, potential domains were carved out from the remaining regions using an algorithm based on the prediction of ID regions (IV in Figure 1) and fed into FORTE (see ‘Prediction of potential domains’ section for details). Once the target sequence-template alignments were obtained, all templates were checked against our ‘apo and holo form table’ originally prepared by us (see ‘Apo and holo form table’ section in Supplementary data). For the template in apo form, the corresponding template (>90% sequence identity) in holo form was selected from the table and vice versa. For both the templates, alignments to target sequences were prepared (VI in Figure 1). In the model building and quality assessment step, 10 models were constructed using the MODELLER (30) software. The quality of the models was evaluated using Stability score (31) and the best 3D model for each alignment was chosen (VII in Figure 1). As of July 2010, 24 878 RefSeq sequences [(22), 14 012 591 residues] encoded in the human genome were processed by the pipeline. In total, 42 581 structure models were constructed, of which 18 228, 14 577, 9163 and 613 templates were detected by BLAST, PSI-BLAST, SWPPA and FORTE, respectively. For 4083 models (9% of all models), both the apo and holo forms were assigned. In total, 35 275 residues were predicted to form long ID regions and removed from target sequences, in advance of the FORTE search. In total, 295 309 residues were eliminated because they were fragmented into small pieces (<26 residues). Multiple models were generated for 9057 RefSeq sequences, while only one model was generated for 12 310 RefSeq sequences. In total, 3511 RefSeq sequences remain without any predicted model. Note that one model does not necessarily correspond to one domain (sometimes it corresponds to a protein chain), but at least more than one-third of human proteins were estimated to be multi-domain proteins. In some cases, we assessed predictions by comparing models with the protein structures recently revealed. Even the sequence identities of the alignments are quite low (<20%), more than half predictions detect correct folds (Supplementary Table S1), indicating that our prediction pipeline worked well.

Treatments of multi domain proteins

Many human proteins are composed of multiple domains and contain a significant fraction of ID regions, as was described above. These factors often prevent predicting protein structures in their full-length forms. As a result, SAHG principally exhibits protein structure as an array of domains. However, when multi-domain structures are available in the templates, the prediction pipeline implicitly prioritizes them to take advantage of the relative domain orientations. The pool of templates consists of SCOP (24,25) domains and whole PDB (7) structures, some of which are not deposited in SCOP. At the template assignment step (I, II, III, V in Figure 1), a set of templates was chosen to maximize the length of modeled regions. This approach is effective in accepting PDB structures spanning multiple domains, as the templates.

Prediction of potential domains

ID regions were predicted using the POODLE-S (18) software, which calculates the probability of being in ID regions for each residue (XIII in Figure 1). As ID regions are considered to play fundamental roles in biological activities (17), their detections should be important. On the other hand, it is necessary to remove long ID regions from the target sequences and assign potential domain regions to assure better performance in structure prediction (FORTE search, V in Figure 1). For this purpose, we evaluated an existing method to predict domain boundaries [Domcut (32)] and found that it was likely to overcut potential domain regions into segments. For other methods (33–35), the same tendency was reported. We considered that the over-prediction was rather disadvantageous for arranging the input sequences for FORTE and developed a new method whose prediction was more ‘moderate’ (containing fewer false positives but more false negatives) based on the results of ID region prediction (IV in Figure 1), since ID regions act as linkers of structural domains (36). First, the results of POODLE-S for a target sequence were converted into a binary sequence in which 0 (P < 0.5) and 1 represent residues in structured regions and that in ID regions, respectively. Next, to detect regions where 0 were continuously abundant, we employed a simple two-state Hidden Markov Model. In this model, one state, ‘a mostly structured region’ (STR), emits 0 more frequently than 1 and the other state, ‘a mostly ID region’ (IDR), emits 1 more frequently than 0. The transition probability between STR and IDR and all the emission probabilities were empirically adjusted to eliminate over-prediction by referring to known domain data in PDB. Finally, the STR regions were estimated from the input binary sequence by calculating a Viterbi path.

Prediction of conformational change upon ligand binding

When templates for both the ligand-bound state (holo form) and unbound state (apo form) were detected using the ‘apo and holo form table’, two types of models were constructed and their structural changes upon ligand-binding are visualized by means of a morphing technique (the MORPH2 program in Martz-Authored PDB Tools see http://www.umass.edu/microbio/rasmol/pdbtools.htm) (X in Figure 1). The animation of conformational change provides significant information for protein function when it is shown with functional residues and ligands. When there was only the template for apo form available and accordingly, only the model for apo form was constructed, its putative ligand and the binding sites were predicted by the eF-seek software (37) (VIII in Figure 1). eF-seek finds potential ligand-binding sites in the model of the apo form, if similar structures were deposited in eF-site, the database of representative ligand-binding sites (38). eF-seek employs a clique search algorithm. As this method is sensitive to the input 3D coordinates, the application was limited to the case of highly accurate structure models being available, i.e. the templates were detected by BLAST search with more than 90% sequence identity to the target sequences. The structural changes upon the predicted ligand-binding were then deduced using the elastic network model (39) and linear response theory to construct a model of the holo form (40) (IX in Figure 1). Note that this approach and presentation is one of the key features of the SAHG database. Animated views of the conformational change of the domains upon ligand-binding could present a deep insight into the protein structure and function relationship (X in Figure 1). As of July 2010, conformational changes upon ligand-binding were predicted for 4083 modeled domains among 42 581 3D models.

Other structure and function predictors

Prediction of protein complex structure

In total, 33 687 protein complex structures were gathered from the PQS database (41). If all the subunits from two complexes were paired with more than 95% sequence identity, the complexes were clustered together in the single-linkage manner. The complex structure with the highest resolution was selected in each cluster of complexes and we obtained a non-redundant set composed of 12 730 template complexes. If a target sequence was related to a given subunit of a template complex with >80% sequence identity by the BLAST search and all the other subunits were related to any target sequences, the complex model was constructed by MODELLER. In total, 8667 complex models were prepared for 3650 target sequences (XI in Figure 1).

Ligand binding information

The ligands and their binding sites were retrieved from constructed models. The ligands were mainly small molecules, such as peptides, nucleotides, metal ions, etc. and some trivial chemicals from buffers or precipitants were excluded. Binding sites were residues whose distances from any ligand atoms were within 5 Å.

Prediction of catalytic residues

For the target sequences of enzymes, catalytic residues were predicted using the EzCatDB database (42) (XII in Figure 1). The EzCatDB database provides annotations on catalytic residues with PDB structure data. The catalytic residues and their positions were already denoted for sequences in the UniProt database (6), as mapped from the catalytic residues on the PDB sequence data, by BLAST search with 10−10 E-value cut-off and POA ver. 2.0 (43). From the human proteins in the UniProt database, target sequences were detected and catalytic residues were assigned in the same manner. Only chemically consistent residues were regarded as catalytic residues. The annotated ‘ACT_SITE’ residues for the human proteins in the UniProt database were also mapped on the target sequences using BLAST search.

Prediction of ID and transmembrane regions

ID regions were predicted by the POODLE-S software (XIII in Figure 1). Transmembrane regions were assigned by the TMHMM software (44) (XIV in Figure 1). If these predicted regions were overlapped with 3D models, the latter take priority over the former.

ACCESS AND INTERFACE

SAHG provides its graphical web interface at http://bird.cbrc.jp/sahg. By clicking a chromosome's image, all proteins coded in the chromosome are listed with the predicted models. By choosing an image of a domain, detailed information of the target protein is shown. More practically, detailed information of specific proteins can be accessed by querying with Gene ID, RefSeq ID, annotation keywords or their combinations or by sequence homology search (BLAST), from an ‘Advanced search page’. In the detailed information page (Figure 2A), all contents for a given protein are shown. The ‘Protein information’ panel provides the information of the protein's RefSeq ID (I in Figure 2A). The sequence in FASTA format is displayed by clicking a ‘Sequence’ button. Predicted protein complexes are shown via a ‘Complex’ button if available (II in Figure 2A). An example of a ‘complex information’ page is shown in Figure 2B. Links to EC number, EzCatDB (42), HPRD (45), Swiss-Prot(6) and InterPro (46) are provided if available. A bar indicator is convenient for seeing the position of the predicted models in the full-length protein (III in Figure 2A). It also shows the annotation of ligand-binding residues (retrieved from the holo models), protein–protein interface residues (from protein complexes), catalytic residues (from EzCatDB), ID regions (by POODLE-S) and transmembrane regions (by TMHMM). By pointing at the colored pins on the bar indicator with a mouse, precise locations (residue numbers) of ligand-binding residues (green pins), protein–protein interface residues (blue) or catalytic residues (red) are shown (see IV in Figure 2A, an example of a catalytic residue). When a modeled region in the bar indicator (blocks on the bar) is selected by clicking, the predicted 3D model appears in the Jmol window (an open-source Java viewer for chemical structures in 3D; see http://www.jmol.org/Jmol) (V in Figure 2A). When models of both apo and holo forms are available (green block on the bar), their structural changes upon ligand-binding are visualized by the morphing technique (the MORPH2 program in Martz-Authored PDB Tools; see http://www.umass.edu/microbio/rasmol/pdbtools.htm) and displayed as an animated image including the ligand molecules in this window. By clicking the bar indicator of ligand-binding or catalytic residues, the corresponding residues are highlighted in ‘CPK spacefill’ scheme in the Jmol window. The ‘Domain Information’ panel shows structural and functional information about a selected model (VI in Figure 2A). The target sequence-template alignments are displayed by an ‘Alignment button’. The predicted model can be downloaded in a pdb format via ‘model PDB’ button. Ligand-binding residues, protein–protein interface residues and catalytic residues are also listed as ‘Functional Residues’ in the same color of the bar indicator. (In Figure 2A, the ‘Domain information’ panel should be scrolled up).

Figure 2.

(A) Example view of SAHGs detailed information page [RefSeqID: NP_002834.3, protein tyrosine phosphatase, receptor type, J isoform 1 precursor (48)]. Labels I, II, III, IV, V and VI indicate the ‘Protein information’ panel, the ‘Complex’ button, the ‘bar indicator’, the ‘Domain information’ panel, the ‘Jmol Window’ and the ‘Catalytic residue’ pin on the bar indicator, respectively. (B) Example view of a ‘Complex information’ page (NP_002834.3). For this protein, only one complex structure in a homo-trimeric form was predicted.

FUTURE DIRECTIONS

To improve the accuracy of structure prediction we are implementing a probabilistic profile–profile alignment method in our prediction pipeline. The method is an enhanced version of the probabilistic sequence–sequence alignment method (47), which has been proven to perform better than PSI-BLAST, in particular for orphan proteins. New versions of structure models provided by the new pipeline will appear in fall of 2010. The results of predictions are being examined to clarify the function and the interaction of human proteins. For some proteins, predicted ligands are being verified experimentally. The structure model set in SAHG will be downloadable in bulk in future.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Japan Science and Technology Agency (JST) – Institute for Bioinformatics Research and Development (BIRD). Funding for open access charge: National Institute of Advanced Industrial Science and Technology (AIST). Conflict of interest statement. None declared.

48 in total

Review 1. Protein folds, functions and evolution.

Authors: J M Thornton; C A Orengo; A E Todd; F M Pearl
Journal: J Mol Biol Date: 1999-10-22 Impact factor: 5.469

2. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.

Authors: A Krogh; B Larsson; G von Heijne; E L Sonnhammer
Journal: J Mol Biol Date: 2001-01-19 Impact factor: 5.469

3. Large Amplitude Elastic Motions in Proteins from a Single-Parameter, Atomic Analysis.

Authors:
Journal: Phys Rev Lett Date: 1996-08-26 Impact factor: 9.161

4. DomCut: prediction of inter-domain linker regions in amino acid sequences.

Authors: Mikita Suyama; Osamu Ohara
Journal: Bioinformatics Date: 2003-03-22 Impact factor: 6.937

5. Protein structure prediction using a variety of profile libraries and 3D verification.

Authors: Kentaro Tomii; Takatsugu Hirokawa; Chie Motono
Journal: Proteins Date: 2005

Review 6. Sequencing technologies - the next generation.

Authors: Michael L Metzker
Journal: Nat Rev Genet Date: 2009-12-08 Impact factor: 53.242

7. Intrinsic protein disorder in complete genomes.

Authors: A K Dunker; Z Obradovic; P Romero; E C Garner; C J Brown
Journal: Genome Inform Ser Workshop Genome Inform Date: 2000

8. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life.

Authors: J J Ward; J S Sodhi; L J McGuffin; B F Buxton; D T Jones
Journal: J Mol Biol Date: 2004-03-26 Impact factor: 5.469

9. EzCatDB: the Enzyme Catalytic-mechanism Database.

Authors: Nozomi Nagano
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10. The SWISS-MODEL Repository and associated resources.

Authors: Florian Kiefer; Konstantin Arnold; Michael Künzli; Lorenza Bordoli; Torsten Schwede
Journal: Nucleic Acids Res Date: 2008-10-18 Impact factor: 16.971

6 in total

1. SDS, a structural disruption score for assessment of missense variant deleteriousness.

Authors: Thanawadee Preeprem; Greg Gibson
Journal: Front Genet Date: 2014-04-21 Impact factor: 4.599

2. Proteome-wide prediction of targets for aspirin: new insight into the molecular mechanism of aspirin.

Authors: Shao-Xing Dai; Wen-Xing Li; Gong-Hua Li; Jing-Fei Huang
Journal: PeerJ Date: 2016-03-10 Impact factor: 2.984

Review 3. Epitranscriptomics and epiproteomics in cancer drug resistance: therapeutic implications.

Authors: Huibin Song; Dongcheng Liu; Shaowei Dong; Leli Zeng; Zhuoxun Wu; Pan Zhao; Litu Zhang; Zhe-Sheng Chen; Chang Zou
Journal: Signal Transduct Target Ther Date: 2020-09-08

4. Discovery of Potent Disheveled/Dvl Inhibitors Using Virtual Screening Optimized With NMR-Based Docking Performance Index.

Authors: Kiminori Hori; Kasumi Ajioka; Natsuko Goda; Asako Shindo; Maki Takagishi; Takeshi Tenno; Hidekazu Hiroaki
Journal: Front Pharmacol Date: 2018-09-05 Impact factor: 5.810

5. KampoDB, database of predicted targets and functional annotations of natural medicines.

Authors: Ryusuke Sawada; Michio Iwata; Masahito Umezaki; Yoshihiko Usui; Toshikazu Kobayashi; Takaki Kubono; Shusaku Hayashi; Makoto Kadowaki; Yoshihiro Yamanishi
Journal: Sci Rep Date: 2018-07-25 Impact factor: 4.379

6. Distinct distributions of genomic features of the 5' and 3' partners of coding somatic cancer gene fusions: arising mechanisms and functional implications.

Authors: Yongzhong Zhao; Won-Min Song; Fan Zhang; Ming-Ming Zhou; Weijia Zhang; Martin J Walsh; Bin Zhang
Journal: Oncotarget Date: 2016-07-20

6 in total