Literature DB >> 15608188

The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis.

Frances Pearl¹, Annabel Todd, Ian Sillitoe, Mark Dibley, Oliver Redfern, Tony Lewis, Christopher Bennett, Russell Marsden, Alistair Grant, David Lee, Adrian Akpor, Michael Maibaum, Andrew Harrison, Timothy Dallman, Gabrielle Reeves, Ilhem Diboun, Sarah Addou, Stefano Lise, Caroline Johnston, Antonio Sillero, Janet Thornton, Christine Orengo.

Abstract

The CATH database of protein domain structures (http://www.biochem.ucl.ac.uk/bsm/cath/) currently contains 43,229 domains classified into 1467 superfamilies and 5107 sequence families. Each structural family is expanded with sequence relatives from GenBank and completed genomes, using a variety of efficient sequence search protocols and reliable thresholds. This extended CATH protein family database contains 616,470 domain sequences classified into 23,876 sequence families. This results in the significant expansion of the CATH HMM model library to include models built from the CATH sequence relatives, giving a 10% increase in coverage for detecting remote homologues. An improved Dictionary of Homologous superfamilies (DHS) (http://www.biochem.ucl.ac.uk/bsm/dhs/) containing specific sequence, structural and functional information for each superfamily in CATH considerably assists manual validation of homologues. Information on sequence relatives in CATH superfamilies, GenBank and completed genomes is presented in the CATH associated DHS and Gene3D resources. Domain partnership information can be obtained from Gene3D (http://www.biochem.ucl.ac.uk/bsm/cath/Gene3D/). A new CATH server has been implemented (http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl) providing automatic classification of newly determined sequences and structures using a suite of rapid sequence and structure comparison methods. The statistical significance of matches is assessed and links are provided to the putative superfamily or fold group to which the query sequence or structure is assigned.

Entities: Chemical Species

Mesh：

Substances：
Proteins

Year: 2005 PMID： 15608188 PMCID： PMC539978 DOI： 10.1093/nar/gki024

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

DESCRIPTION OF THE CATH HIERARCHY AND CURRENT POPULATION STATISTICS

The CATH database is a hierarchical classification of domains into sequence- and structure-based families and fold groups. Table 1 shows the population of the latest release of CATH (Version 2.5.1, released January 2004). In the lowest level of the hierarchy, sequences are clustered according to significant sequence similarity (35% identity and above, the S-Level). At higher levels, domains are grouped according to whether they share significant sequence, structural and/or functional similarity (homologous superfamilies, H-Level) or just structural similarity (fold or topology group, the T-level). Fold groups sharing similar architectures, i.e. similarities in the arrangements of their secondary structures regardless of connectivity are then merged into the common architectures (the A-Level). At the top of the hierarchy, domains are clustered depending on their class, i.e. the percentage of α−helices or β-strands (the C-Level).

Table 1.

Populations of the different levels in the CATH hierarchy

Class	1	2	3	4	Total	(5)
A	5	19	12	1	37	(n/a)
T	227	139	361	86	813	(n/a)
H	433	286	659	89	1467	(n/a)
S	957	961	2008	110	4036	(1071)
All	9013	12962	20411	843	43229	(12475)

IMPROVED CLASSIFICATION PROTOCOLS

Below we describe some new CATH associated resources and protocols that increase the speed and reliability of classifying newly determined protein structures in the CATH database.

Validation of homologues using the CATH dictionary of homologous Superfamilies (DHS)

The CATH associated Dictionary of Homologous Superfamilies (DHS) (http://www.biochem.ucl.ac.uk/bsm/dhs/) was established in 1997 (1) and contains a variety of sequence, structural and functional information for each superfamily in CATH. It was updated recently for CATH version 2.5.1, which contains 1467 homologous superfamilies, 334 of which are populated with three or more remote homologues (<35% sequence identity). The DHS contains information on all the pairwise sequence similarities and structural similarities for all pairs of relatives in each superfamily. Sequence similarity is recorded by sequence identity and E-value. Structural similarity is recorded by pairwise SSAP score (2) and also, by E-values determined against a distribution of scores obtained by comparing all non-redundant structures with each other. Multiple structure alignments are derived for structurally coherent subgroups of relatives, having a pairwise SSAP score of >85 against all relatives in the subgroup. These are generated using the CORA algorithm (3) and displayed using CORAplot (3). The current DHS contains 671 structural alignments from 416 superfamilies. Highly conserved sequence positions, which may be associated with functionally important sites, are highlighted. Two new methods have been devised to illustrate the degree of structural divergence across the superfamily. Both exploit a multiple structure alignment to identify equivalent secondary structures across the superfamily and inserted secondary structures. Plots give information on highly conserved secondary structures that are diagnostic for the particular superfamily and on the degree of structural embellishment occurring in diverse relatives. Putative homologues to a particular CATH superfamily can be aligned against structural relatives in order to determine whether their structural characteristics fall within the range of structural diversity observed across the superfamily. Information on the population of the superfamily is also provided so that users can gauge how well the superfamily has been sampled to date. Functional annotations are also provided for each superfamily in the DHS by recruiting relevant functional data from the Protein Data Bank (PDB) (4), GenBank (5), ENZYME (6), KEGG (7) and Gene Ontology (8) databases. The more than 10-fold expansion in the extended CATH database (from 43 299 CATH structural domain sequences to 616 470 by including related GenBank sequences and genome sequences) has significantly increased the amount of functional data available for a particular superfamily. Expansion in the functional information together with more informative descriptions of structural variability in each CATH superfamily considerably assists in validating new homologues classified in CATH. Furthermore, links to the DHS are provided for structural matches identified using the CATH server.

Improved detection of remote homologues using an extended CATH-HMM model library

Profile based methods for sequence comparison were developed in the early 1980s and allowed recognition of more distant homologues than pairwise based approaches (9). Benchmarking of several publicly available methods, including those using position-specific scoring matrices and hidden Markov models (HMMs) have been undertaken by several groups (10,11). These approaches used datasets of distant homologues selected from the structural classifications, such as SCOP and CATH, to determine the sensitivity of various profile based methods, e.g. HMMs (12) and PSI-BLAST (13). We recently used a dataset of remote structural homologues from the CATH database (<35% sequence identity), which had been validated by structure comparison and manual inspection to assess the performance of several HMM based strategies (Strategies for Improved Fold and Superfamily Recognition in Genome Annotation; I. Sillitoe, personal communication). HMMs were built using the SAM-T technology developed by Karplus et al. (14). A total of 23 876 HMM models were built for representative sequences from each sequence family in the extended CATH database (containing 616 470 domain sequences). The extended model library gives a 10% increase in coverage for remote homologue detection compared to the standard CATH HMM model library, with a low error rate (0.1%) (I. Sillitoe, personal communication). It can be seen from Figure 1 that on average, nearly 87% of homologues classified in CATH over the last two years could be recognized using sequence comparison methods, both pairwise sequence alignment and scans against the more sensitive extended CATH-HMM model library.

Figure 1

The proportion (%) of structures from the PDB that have been classified in CATH over the last two years using different sequence comparison or structure comparison methods. Blue segment: PDB sequences with 95% sequence identity or more to existing CATH domains, recognized using SSEARCH. Magenta segment: PDB sequences with 30% sequence identity or more to existing CATH domains, recognized using SSEARCH. Yellow segment: PDB entries that can be assigned to existing CATH superfamilies by scanning the HMM library. Green segment: PDB entries that can be assigned to CATH superfamilies by structure comparisons against CATH representatives using SSAP. Purple segment: PDB entries that can be assigned to CATH fold groups by structure comparisons against CATH representatives using SSAP. Orange segment: PDB entries that do not match any CATH structure and represent novel folds.

Expansion of CATH with sequence relatives from completed genomes and domain partnership information

We have recently devised protocols for identifying sequence relatives to CATH superfamilies in completed genomes (15). To date, nearly one million sequences from 150 completed genomes have been scanned against the CATH-HMM model library (15). Between 40 and 60% of sequences or partial sequences from each genome could be assigned to a CATH superfamily. Genome sequences were also scanned against libraries of HMM models from the Pfam database (release 10) (16) in order to extend the domain annotation of each genome sequence and provide more comprehensive information on domain partnerships. Sequence relatives to CATH superfamilies, identified in this way are displayed in the CATH related DHS and Gene3D resources. Gene3D displays the domain composition of each gene annotated by CATH and Pfam domains. CATH family data in the Gene3D resource has revealed some intriguing insights into the expansion of superfamilies involved in metabolism and regulation in bacterial genomes (17). Figure 2 shows that the power-law like trends first detected in the structural classifications are mirrored when sequence relatives from the genomes are also included. Considering the structural data alone, it can be seen from Figure 2a that fewer than 10 of the most highly populated folds in the CATH database account for nearly 25% of all superfamilies in the PDB. These folds were previously described as superfolds as they are adopted by many diverse homologous superfamilies (18). When genome sequences are included it can be seen from Figure 2b that the same fold groups dominate the genomes, as they are adopted by nearly 45% of all close sequence families (relatives have 35% or more sequence identity), of known structure, in the genomes.

Figure 2

CATHerine wheels (a) illustrating the distribution of domain structures from the PDB among the different levels in the CATH hierarchy. The three classes are illustrated in colour, mainly α pink, mainly β yellow and α−β green. The inner wheel corresponds to different architectures in the classification and the outer wheel to different fold groups. Each fold group has been subdivided according to the numbers and populations of different homologous superfamilies adopting that fold. (b) Illustrating the distribution of CATH domains among the sequences from 150 completed genomes, in Gene3D. In this case, the fold groups labelled in the outer circle have been divided according to the number and size of close sequence families within each fold group.

THE CATH SERVER

A new protocol has been developed for searching CATH with a newly determined protein structure. Structures submitted to the server (http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl) are first processed by the DDMake suite of programs that generate derived data from the PDB coordinate files (e.g. secondary structure data, residue accessibilities and φψ data, sequence data in the FASTA format, etc.). The query sequence is scanned against the CATH-HMM model library to identify more remote homologues. Threshold E-values used to recognize homologues are predetermined by benchmarking with validated structural homologues from CATH (I. Sillitoe, personal communication). If the sequence returns a significant match to any relative in one or more CATH superfamilies, representatives from all close sequence families within those superfamilies are structurally compared with the query structure using the SSAP structure alignment program (2). The top 10 structural matches, sorted in the order of SSAP score are then displayed together with information on the degree of sequence and structural similarity and with links to the CATH page and the DHS page for each CATH superfamily identified. Rasmol images are also provided for the top 10 matches. Any query structure unmatched by the CATH-HMM library is scanned against a library of representative structures from each close sequence family in CATH using the rapid structure comparison algorithm, CATHEDRAL (19). CATHEDRAL uses a robust statistical framework based on the extreme value distributions observed for random similarities to assess significance. If the query structure significantly matches one or more CATH superfamilies, SSAP comparisons are performed for all sequence representatives in those superfamilies and the top 10 matches are displayed, as before.

19 in total

1. KEGG: kyoto encyclopedia of genes and genomes.

Authors: M Kanehisa; S Goto
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. The ENZYME database in 2000.

Authors: A Bairoch
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

3. The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologues.

Authors: J E Bray; A E Todd; F M Pearl; J M Thornton; C A Orengo
Journal: Protein Eng Date: 2000-03

4. Evolution of protein superfamilies and bacterial genome size.

Authors: Juan A G Ranea; Daniel W A Buchan; Janet M Thornton; Christine A Orengo
Journal: J Mol Biol Date: 2004-02-27 Impact factor: 5.469

5. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods.

Authors: J Park; K Karplus; C Barrett; R Hughey; D Haussler; T Hubbard; C Chothia
Journal: J Mol Biol Date: 1998-12-11 Impact factor: 5.469

6. Hidden Markov models for detecting remote protein homologies.

Authors: K Karplus; C Barrett; R Hughey
Journal: Bioinformatics Date: 1998 Impact factor: 6.937

Review 7. Hidden Markov models.

Authors: S R Eddy
Journal: Curr Opin Struct Biol Date: 1996-06 Impact factor: 6.809

Review 8. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

9. Protein structure alignment.

Authors: W R Taylor; C A Orengo
Journal: J Mol Biol Date: 1989-07-05 Impact factor: 5.469

10. Protein superfamilies and domain superfolds.

Authors: C A Orengo; D T Jones; J M Thornton
Journal: Nature Date: 1994-12-15 Impact factor: 49.962

89 in total

1. Prediction of inter-residue contact clusters from hydrophobic cores.

Authors: Peng Chen; Chunmei Liu; Legand Burge; Mohammad Mahmood; William Southerland; Clay Gloster
Journal: Int J Data Min Bioinform Date: 2008-12-11 Impact factor: 0.667

2. Crystal structure of a novel non-Pfam protein PF2046 solved using low resolution B-factor sharpening and multi-crystal averaging methods.

Authors: Jing Su; Yang Li; Neil Shaw; Weihong Zhou; Min Zhang; Hao Xu; Bi-Cheng Wang; Zhi-Jie Liu
Journal: Protein Cell Date: 2010-06-04 Impact factor: 14.870

3. A composite score for predicting errors in protein structure models.

Authors: David Eramian; Min-yi Shen; Damien Devos; Francisco Melo; Andrej Sali; Marc A Marti-Renom
Journal: Protein Sci Date: 2006-06-02 Impact factor: 6.725

4. A consensus view of protein dynamics.

Authors: Manuel Rueda; Carles Ferrer-Costa; Tim Meyer; Alberto Pérez; Jordi Camps; Adam Hospital; Josep Lluis Gelpí; Modesto Orozco
Journal: Proc Natl Acad Sci U S A Date: 2007-01-10 Impact factor: 11.205

5. Strict rules determine arrangements of strands in sandwich proteins.

Authors: A E Kister; A S Fokas; T S Papatheodorou; I M Gelfand
Journal: Proc Natl Acad Sci U S A Date: 2006-03-02 Impact factor: 11.205

Review 6. Exploiting protein structure data to explore the evolution of protein function and biological complexity.

Authors: Russell L Marsden; Juan A G Ranea; Antonio Sillero; Oliver Redfern; Corin Yeats; Michael Maibaum; David Lee; Sarah Addou; Gabrielle A Reeves; Timothy J Dallman; Christine A Orengo
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2006-03-29 Impact factor: 6.237

7. SCOOP: a simple method for identification of novel protein superfamily relationships.

Authors: Alex Bateman; Robert D Finn
Journal: Bioinformatics Date: 2007-02-03 Impact factor: 6.937

8. Origins and evolution of the formin multigene family that is involved in the formation of actin filaments.

Authors: Dimitra Chalkia; Nikolas Nikolaidis; Wojciech Makalowski; Jan Klein; Masatoshi Nei
Journal: Mol Biol Evol Date: 2008-10-06 Impact factor: 16.240

9. Coarse-grained description of protein internal dynamics: an optimal strategy for decomposing proteins in rigid subunits.

Authors: R Potestio; F Pontiggia; C Micheletti
Journal: Biophys J Date: 2009-06-17 Impact factor: 4.033

10. Golgi localization of glycosyltransferases requires a Vps74p oligomer.

Authors: Karl R Schmitz; Jingxuan Liu; Shiqing Li; Thanuja Gangi Setty; Christopher S Wood; Christopher G Burd; Kathryn M Ferguson
Journal: Dev Cell Date: 2008-04 Impact factor: 12.270