Literature DB >> 16381865

Gene3D: modelling protein structure, function and evolution.

Corin Yeats¹, Michael Maibaum, Russell Marsden, Mark Dibley, David Lee, Sarah Addou, Christine A Orengo.

Abstract

The Gene3D release 4 database and web portal (http://cathwww.biochem.ucl.ac.uk:8080/Gene3D) provide a combined structural, functional and evolutionary view of the protein world. It is focussed on providing structural annotation for protein sequences without structural representatives--including the complete proteome sets of over 240 different species. The protein sequences have also been clustered into whole-chain families so as to aid functional prediction. The structural annotation is generated using HMM models based on the CATH domain families; CATH is a repository for manually deduced protein domains. Amongst the changes from the last publication are: the addition of over 100 genomes and the UniProt sequence database, domain data from Pfam, metabolic pathway and functional data from COGs, KEGG and GO, and protein-protein interaction data from MINT and BIND. The website has been rebuilt to allow more sophisticated querying and the data returned is presented in a clearer format with greater functionality. Furthermore, all data can be downloaded in a simple XML format, allowing users to carry out complex investigations at their own computers.

Entities: Gene Species

Mesh：

Substances：
Proteins
Proteome

Year: 2006 PMID： 16381865 PMCID： PMC1347420 DOI： 10.1093/nar/gkj057

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Detailed knowledge of the functional modules that a protein is composed of often allows a more accurate prediction of its function than simply transferring functional information from the most similar annotated sequences. Conversely, grouping protein sequences into families can aid in accurate information transfer when the domain architecture does not provide a specific function. In Gene3D we have attempted to combine both of these approaches in a synergistic manner. To further aid interpretation, we have begun including external sources of high quality functional data [i.e. GO (1)]. One principle function of Gene3D is to map CATH (2) domain families to protein sequences. This is a similar task as that carried out by Superfamily (3) for SCOP (4). It requires a different approach than that for identifying domains within structural data and the steps required to model structural domains and to correctly locate their boundaries within the large sequence databases are not trivial. This process is carried out by Gene3D and we continually look to improve—our recent progress is described in (5)—by exploiting hidden Markov model (HMM) technology. In this release we have extended our predictions to the entire UniProt sequence database (6). To improve the reliability of functional data transfer between sequences, we have also clustered UniProt into protein families using Tribe-MCL (7). There are several databases supplying either domain family information (8,9) or whole protein family information (10), but Gene3D is the most comprehensive resource to combine both views of the protein world into a unified system. We also provide specific calculations for 240 genomes (as of September 1st 2005) derived from Integr8 (11). Another major renovation includes the use of the BioMap warehouse (12) to supply other sources of structural data, protein–protein interaction data and various functional annotations, including GO and COGs (13). Finally, we have completely redesigned the website to provide a more intuitive and flexible interface so as to cope with the much richer data we are able to provide.

RECENT DEVELOPMENTS

Gene3D protein families

Information can be more easily transferred between sequences when they belong to the same protein family; i.e. they have a common evolutionary ancestor. We have clustered the ∼1.8 million sequences in Gene3D into families using Tribe-MCL. The process is described in detail in (14). There are 203 982 non-singleton families including 56 221 of more than 5 sequences and 556 224 ‘orphan’ sequences. Within the complete genome data only, which consists of 862 886 proteins, there are 80 291 non-singleton families and 212 567 orphan sequences. These numbers will change as we improve our family definitions and new genomes are added. Our estimates suggest that, while 50% of domains found in genomes belong to families common to all kingdoms (universal), only 10% of proteins belong to universal families (14). This suggests the importance of using protein families in conjunction with domain families to accurately predict functions. The families have also been subclustered by sequence identity at ten different levels from 30% to 100%. This allows greater precision in information transfer and improved understanding of the evolution of the protein family. The families have been refined by ensuring an acceptable similarity in domain composition and sequence length; furthermore, manual examinations are carried out to further improve the accuracy and consistency of these families.

Functional data

In order to provide comprehensive functional annotation we now use the in-house BioMap database (12). BioMap is essentially a warehouse for diverse biological data and contains mappings between several resources and the UniProt sequence database. This provides us both with rich descriptions for each sequence but also provides a strong internal infrastructure, allowing regular updating of the website. These data are linked through representative sequences using the MD5 sequence digest value. Being able to combine data in this manner can be very powerful when analysing the evolution of protein function, as was demonstrated in (15).

The website

The Gene3D website has been completely redeveloped to provide more sophisticated querying capabilities and to be able to easily incorporate new functionality and new data types. No javascript is used, improving browser compatability. It is now possible to query by CATH code, Pfam ID or accession, UniProt ID or accession, COG identifier and NCBI taxonomy code. These terms are also tagged within the results pages to facilitate querying of results. We have also included a BLAST (16) search facility which will identify the likely family that the query sequence belongs to. Two main types of data return pages have been developed—the detailed view and the summary view (see below). The detailed views return any CATH, Gene3D protein family, Pfam, GO, KEGG (17), COGs/KOGs, BIND (18) and MINT (19) data associated with the protein or set of proteins in the query. Domain information, and other structural information (low complexity regions, coiled coils and signal peptides) are displayed using the Pfam domain drawing service so as to provide a depiction that is familiar to many. We also expect to provide transmembrane helix predictions using the SPLIT 4.0 (20) software shortly. The summary view provides a simple aggregate description of the data set. This view includes all GO terms and all distinct domain architectures found for the proteins under investigation. We have also developed an XML format output so that users can easily download all the data returned in a machine and human readable format. This is to aid both automated queries and obtaining very large datasets without attempting to display them as HTML. Other notable new features include: an on-the-fly sequence alignment facility [using MUSCLE (21)] to aid users in interpreting and validating structural and functional assignments, mouse-over activated summaries for structural and functional terms and direct links from terms to all the source databases.

Web services (DAS)

In addition to the website we provide comprehensive web-services [including XML-RPC and DAS (22)] for programmatic access to the resource. Web-services are crucial to provide remote users straightforward tools to integrate our resource in their applications. The web-services API is documented at <>. The Gene3D DAS server offers 2 services provided by ProServer (). The gene3d_uniprot DAS server () returns a list of Gene3D features for a query UniProt sequence. For each feature the following information is supplied: the Gene3D ID, the feature source method (CATH, Pfam etc.), the feature start/stop coordinates and a note consisting of the method identifier (Cath ID, Pfam accession etc.). The g3dtribe_uniprot DAS server () returns the Gene3D family id for a query UniProt sequence. This annotation applies to the whole sequence and therefore has a range of 0 to 0 (DAS shorthand for ‘the whole sequence’). Information on using the DAS servers can be found at .

USING GENE3D

The nature and arrangement of the data in Gene3D allows researchers to easily ask questions about the general rules of protein evolution and functional distribution and also to investigate individual proteins. Below we describe a few simple investigations.

Genome domain content (multi-domainicity)

With the Gene3D structural data it is simple to approximate the proportion of proteins in any genome that have more than one domain. We initially used the PFscape (14) protocol and ProteinMiner—a locally developed data-mining tool to determine the domain composition for each protein in the complete genome set. The proteins were then split into two sets—those that had at least one known domain (‘annotated’) and those that didn't (‘unannotated’). In the first set, if there was a gap of more than 50 residues then we considered that this indicated the presence of an unidentified domain, allowing us to split the annotated set into single domain proteins and multi-domain proteins. For each genome we then calculated a length threshold based on the average length of the single domain proteins plus two thirds off the standard deviation. The unnatotated proteins were then divided in single domain and multi-domain proteins. The results of this calculation are shown in Figure 1. The numbers obtained are roughly in concordance with the results obtained by Eckman et al. (23) when they used a gap size of 50 residues to be equal to a domain and also within a couple of percent of that manually calculated by S. Teichmann et al. (24) for Mycoplasma genitalium.

Figure 1

Distribution of multi-domain proteins in 240 genomes. For each genome the approximate percentage of multi-domain proteins was calculated. The likely domain content for those protein which have no known domains was approximated on the basis of length (for details see text). The length threshold was calculated for each genome individually. Of note, the multi-domain percentage for Eukaryotes was within the range displayed by Eubacteria, but the mean for Eukaryotes is substantially higher than for Prokaryotes.

Annotating hypothetical proteins

Gene3D can also be used to effectively predict functions for ‘hypothetical proteins’. As an example we took the first three non-viral proteins with the name ‘hypothetical protein’ returned by UniProt-O43716 (human), Q9SMZ9 (Arabidopsis thaliana) and O30176 (Archaeoglobus fulgidis). By examining the functional terms and structural predictions associated with these proteins and the protein families they belong to, we were able to assign some annotation to all of them. Use of the sequence alignment tool allowed us to justify this transfer of information. O43716 is a Glu-tRNAGln amidotransferase C subunit (Pfam:PF02686). Q9SMZ9 is possibly involved in mitochondrial distribution and morphology (GO process:0007005). O30176 contains a MazG nucleotide pyrophosphohydrolase domain (Pfam:PF03819).

Identifying structural targets

Another current task for Gene3D is in identifying good targets for the second phase of the NIH-funded protein structure initiative (PSI2). Using our data we are able to identify those domain sequences as determined using the Pfam hits that do not have a close (>30% sequence identity) homologue with a solved structure as determined using the CATH classification.

DISCUSSION AND DEVELOPMENT

Gene3D has been redesigned to provide a feature rich workbench for both the laboratory and the computational biologist. It is possible to investigate individual proteins in detail, with most major sources of functional information presented, and also it is easy to download large datasets for global functional or evolutionary analyses. The new internal infrastructure allows novel data sources to be easily included and so we anticipate significant expansion in the data we present. We also wish to expand the website tools, particularly in regards to genome comparison, to aid researchers in understanding the structural evolution of proteomes.

23 in total

1. CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins.

Authors: E V Kriventseva; W Fleischmann; E M Zdobnov; R Apweiler
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

2. Basic charge clusters and predictions of membrane protein topology.

Authors: Davor Juretić; Larisa Zoranić; Damir Zucić
Journal: J Chem Inf Comput Sci Date: 2002 May-Jun

3. The Gene Ontology (GO) database and informatics resource.

Authors: M A Harris; J Clark; A Ireland; J Lomax; M Ashburner; R Foulger; K Eilbeck; S Lewis; B Marshall; C Mungall; J Richter; G M Rubin; J A Blake; C Bult; M Dolan; H Drabkin; J T Eppig; D P Hill; L Ni; M Ringwald; R Balakrishnan; J M Cherry; K R Christie; M C Costanzo; S S Dwight; S Engel; D G Fisk; J E Hirschman; E L Hong; R S Nash; A Sethuraman; C L Theesfeld; D Botstein; K Dolinski; B Feierbach; T Berardini; S Mundodi; S Y Rhee; R Apweiler; D Barrell; E Camon; E Dimmer; V Lee; R Chisholm; P Gaudet; W Kibbe; R Kishore; E M Schwarz; P Sternberg; M Gwinn; L Hannick; J Wortman; M Berriman; V Wood; N de la Cruz; P Tonellato; P Jaiswal; T Seigfried; R White
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

4. Evolution of protein superfamilies and bacterial genome size.

Authors: Juan A G Ranea; Daniel W A Buchan; Janet M Thornton; Christine A Orengo
Journal: J Mol Biol Date: 2004-02-27 Impact factor: 5.469

5. Assessing strategies for improved superfamily recognition.

Authors: Ian Sillitoe; Mark Dibley; James Bray; Sarah Addou; Christine Orengo
Journal: Protein Sci Date: 2005-06-03 Impact factor: 6.725

Review 6. MINT: a Molecular INTeraction database.

Authors: Andreas Zanzoni; Luisa Montecchi-Palazzi; Michele Quondam; Gabriele Ausiello; Manuela Helmer-Citterich; Gianni Cesareni
Journal: FEBS Lett Date: 2002-02-20 Impact factor: 4.124

7. The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis.

Authors: Frances Pearl; Annabel Todd; Ian Sillitoe; Mark Dibley; Oliver Redfern; Tony Lewis; Christopher Bennett; Russell Marsden; Alistair Grant; David Lee; Adrian Akpor; Michael Maibaum; Andrew Harrison; Timothy Dallman; Gabrielle Reeves; Ilhem Diboun; Sarah Addou; Stefano Lise; Caroline Johnston; Antonio Sillero; Janet Thornton; Christine Orengo
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

8. The Universal Protein Resource (UniProt).

Authors: Amos Bairoch; Rolf Apweiler; Cathy H Wu; Winona C Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; Maria J Martin; Darren A Natale; Claire O'Donovan; Nicole Redaschi; Lai-Su L Yeh
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

9. The COG database: an updated version includes eukaryotes.

Authors: Roman L Tatusov; Natalie D Fedorova; John D Jackson; Aviva R Jacobs; Boris Kiryutin; Eugene V Koonin; Dmitri M Krylov; Raja Mazumder; Sergei L Mekhedov; Anastasia N Nikolskaya; B Sridhar Rao; Sergei Smirnov; Alexander V Sverdlov; Sona Vasudevan; Yuri I Wolf; Jodie J Yin; Darren A Natale
Journal: BMC Bioinformatics Date: 2003-09-11 Impact factor: 3.169

10. MUSCLE: a multiple sequence alignment method with reduced time and space complexity.

Authors: Robert C Edgar
Journal: BMC Bioinformatics Date: 2004-08-19 Impact factor: 3.169

26 in total

1. Towards fully automated structure-based function prediction in structural genomics: a case study.

Authors: James D Watson; Steve Sanderson; Alexandra Ezersky; Alexei Savchenko; Aled Edwards; Christine Orengo; Andrzej Joachimiak; Roman A Laskowski; Janet M Thornton
Journal: J Mol Biol Date: 2007-01-30 Impact factor: 5.469

Review 2. In silico characterization of proteins: UniProt, InterPro and Integr8.

Authors: Nicola Jane Mulder; Paul Kersey; Manuela Pruess; Rolf Apweiler
Journal: Mol Biotechnol Date: 2007-10-04 Impact factor: 2.695

Review 3. Protein Bioinformatics Databases and Resources.

Authors: Chuming Chen; Hongzhan Huang; Cathy H Wu
Journal: Methods Mol Biol Date: 2017

4. ProServer: a simple, extensible Perl DAS server.

Authors: Robert D Finn; James W Stalker; David K Jackson; Eugene Kulesha; Jody Clements; Roger Pettett
Journal: Bioinformatics Date: 2007-01-18 Impact factor: 6.937

5. Transcriptional profiling of antioxidant defense system and heat shock protein (Hsp) families in the cadmium- and copper-exposed marine ciliate Euplotes crassu.

Authors: Bo-Mi Kim; Jae-Sung Rhee; Ik-Young Choi; Young-Mi Lee
Journal: Genes Genomics Date: 2017-10-16 Impact factor: 1.839