Literature DB >> 18996897

The CATH classification revisited--architectures reviewed and new ways to characterize structural divergence in superfamilies.

Alison L Cuff¹, Ian Sillitoe, Tony Lewis, Oliver C Redfern, Richard Garratt, Janet Thornton, Christine A Orengo.

Abstract

The latest version of CATH (class, architecture, topology, homology) (version 3.2), released in July 2008 (http://www.cathdb.info), contains 114,215 domains, 2178 Homologous superfamilies and 1110 fold groups. We have assigned 20,330 new domains, 87 new homologous superfamilies and 26 new folds since CATH release version 3.1. A total of 28,064 new domains have been assigned since our NAR 2007 database publication (CATH version 3.0). The CATH website has been completely redesigned and includes more comprehensive documentation. We have revisited the CATH architecture level as part of the development of a 'Protein Chart' and present information on the population of each architecture. The CATHEDRAL structure comparison algorithm has been improved and used to characterize structural diversity in CATH superfamilies and structural overlaps between superfamilies. Although the majority of superfamilies in CATH are not structurally diverse and do not overlap significantly with other superfamilies, approximately 4% of superfamilies are very diverse and these are the superfamilies that are most highly populated in both the PDB and in the genomes. Information on the degree of structural diversity in each superfamily and structural overlaps between superfamilies can now be downloaded from the CATH website.

Entities: Chemical

Mesh：

Substances：
Proteins

Year: 2008 PMID： 18996897 PMCID： PMC2686597 DOI： 10.1093/nar/gkn877

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

CURRENT POPULATION OF THE CATH HIERARCHY

CATH (class, architecture, topology, homology) is a hierarchical protein domain classification (1) where domains are classified manually by curators, guided by prediction algorithms (such as structure comparison). Each protein structure is decomposed into one or more chains which in turn are split into one or more domains before being classified into homologous superfamilies according to both structure and function. At the Class, or C-level, the domains are classified simply on the basis of their secondary structure content [whether they are mostly α-helical (Class 1) or β-sheet (Class 2), contain a significant percentage of both secondary structure elements (Class 3) or contain very little secondary structure (Class 4)]. The domains within each class are then sorted according to their architecture—that is similarities in the arrangements of secondary structures in 3D space. Each architecture (A-level) is further broken down into one or more topology, or fold, groups (T-level), where the connectivity between these secondary structures are taken into account. The domains are then classified into their respective homologous superfamilies (H-level) according to similarities in sequence, structure and/or function. Clustering performed at the H-level (>35% sequence identity and above) then produces one or more sequence families for each of the homologous superfamilies (S-level). Table 1 below shows the current population of different levels in the CATH hierarchy.

Table 1.

Release statistics for CATH version 3.2

Class	Architecture	Topology	Homologous superfamily	S35 family
1	5	310	682	2078
2	20	196	438	2062
3	14	512	956	4558
4	1	92	102	173
Total	40	1110	2178	8871

Release statistics for CATH version 3.2

A PERIODIC TABLE OF CATH ARCHITECTURES

A visual snapshot of the domain architectures in the CATH database is now captured in a new ‘Protein Chart’ (2). This chart, inspired by Taylor's ‘Periodic Table’ of protein structures devised in 2002 (3), shows fold representatives of all the most regular domain architectures currently classified in the CATH database. It is organized so that the smallest representative for any given architecture is at the top of the chart and the largest at the bottom, giving a guide to the variation in size and structure that can occur. Functional information and population statistics for each architecture are provided in a table accessible from the CATH web site (http://www.cathdb.info/download#version_v3.1). Using the chart, we have identified nine new architectures classified since CATH architectures were first presented in 1997 (1) (Figure 1). These new architectures are not highly populated accounting for only ∼4% of predicted CATH domain sequences in the genomes.

Figure 1.

Some of the architectures new to CATH since 1997.

Some of the architectures new to CATH since 1997. In the mainly-α class, the α-solenoid architecture (1.40) contains only one superfamily. Domains provide an α-helical scaffold for a central hydrophobic cavity, which contains light harvesting molecules (4). The αα-barrel (1.50) contains 2 α-helical layers, with long loops that create a tunnel. They are typically glycosyl hydrolases (5). The α-horseshoe (1.25) is a super helical structure made up of a number of 3 α-helical orthogonal bundle repeats. In the mainly-β class, we identify a new β-propellor—the 5-bladed propeller (2.115). A new sandwich architecture, the 3-layer βββ-sandwich is made up of three anti-parallel layered into three adjacent stacks with an immunoglobulin-like sub-domain. Most are rieske iron–sulphur proteins (7) Four new architectures are classified in the α-β class. Super-rolls are made from twisted anti-parallel β-strands capped by 2 α-helices. All classified domains bind to and neutralize lipopolysaccharides in the outer-membrane of gram-negative bacteria (8). The 3-layer (βαβ) sandwich architecture contains 10 domains in three different folds. The most highly populated fold is largely comprised of bacterial heat shock proteins. The αβ-prism is made up from a repeating folding unit composed of two parallel α-helices and a β-sheet. Domains with this fold are commonly found in 5-enolpyruvylshikimate-3-phosphate synthase and UDP-N-acetylglucosamine enolpyruvyl transferase (9). 5-Stranded αβ-propellers are composed of ββαβ repeats arranged in a circular fashion surrounding a channel in the centre of the structure (10).

INCREASING THE PROPORTION OF NOVEL STRUCTURES CLASSIFIED IN CATH

Recent analyses of CATH domain annotations in Gene3D (11) showed that between 80–90% of domain sequences in completely sequenced genomes can be assigned to a structural family in CATH. This suggests that the CATH database now provides a reasonably comprehensive structural view of the protein universe. Those protein families that have yet to be represented structurally are likely to be transmembrane or disordered proteins. The fact that most major folds are represented in CATH is reflected in the continual decrease in the proportion of non-redundant structures found to adopt novel folds. Table 2 gives the number of new folds identified over the last 10 years and the percentage of non-redundant structures deposited adopting a novel fold.

Table 2.

Numbers of structures classified in CATH and the proportion of novel folds per year

Year of PDB release	Number PDB structures classified in CATH	Number novel folds	Novel folds (%)
1997	1584	92	5.81
1998	1876	87	4.64
1999	2226	104	4.67
2000	2549	90	3.53
2001	2766	91	3.29
2002	2821	73	2.59
2003	3668	34	0.93
2004	3711	61	1.64
2005	3198	6	0.19
2006	3163	18	0.57
2007	2802	11	0.39

Numbers of structures classified in CATH and the proportion of novel folds per year

IMPROVEMENTS TO THE STRUCTURAL COMPARISON METHOD CATHEDRAL

We have further improved our domain boundary prediction and fold assignment algorithm, CATHEDRAL (12), which is used to guide curators in the manual classification process. Whole PDB structures are scanned against a library of representatives from the CATH database to recognize constituent domains. CATHEDRAL initially performs rapid secondary structure comparison against the library to identify putative fold matches, which are then more accurately aligned at the residue level using dynamic programming. A support vector machine (SVM) is used to combine different measures of structural similarity and rank hits to the query structure. All domains predicted to be genuine hits by the SVM are assigned in an iterative fashion to identify constituent folds and domain boundaries from the residue-based structural alignments. Hits are allowed to overlap by up to 30 residues and conflicts are resolved by a new algorithm that moves along the overlapping region and assigns each residue to the closest domain.

STRUCTURAL DIVERSITY AND THE VALIDITY OF THE CATH HIERARCHY

There has been much debate on the existence of a protein fold continuum and the validity of a hierarchical protein classification system (13–16). Greene et al. (17) previously explored the concept of ‘lateral links’ across the CATH hierarchy as a way of capturing structural relationships between superfamilies. More recent in-house analyses have shown that, within some of the most highly populated superfamilies, significant structural changes have occurred. Typically, the domains within a given superfamily possess a ‘common structural core’ comprising 40–50% of the residues in the structure, but there can be considerable structural embellishments to this core and some domains can be up to three times larger than the typical representative of the family (18). In some cases, the embellishments are so considerable that the domain in question can be considered to exhibit a different fold to the other domains in the family. Due to the improvements made to CATHEDRAL (12), we have been able to perform a database-wide analysis of the similarities between all protein structures in the CATH database. This has been used to examine the extent to which superfamilies diverge structurally and determine which superfamilies overlap with one another. Domains in each superfamily were first assigned to ‘structurally similar groups’ (SSGs), whereby a domain is assigned to a particular SSG if they exhibit significant structural similarity with other domains in that group (Cuff,A.L. et al., submitted for publication). That is, if they share a normalized RMSD (SiMAX) structure comparison score of <5 Å (Cuff,A.L. et al., submitted for publication). Superfamilies with five or more SSGs were deemed to be structurally diverse. The majority of homologous superfamilies (∼96%) in the database are structurally conserved and structurally coherent, that is, they contain less than five SSGs and do not overlap with any other superfamily. However, the ∼4% of CATH superfamilies that do show considerable structural diversity, are those which are the most highly populated in CATH, accounting for 40% of domain sequences in the genomes (Figure 2) (Cuff,A.L. et al., submitted for publication).

Figure 2.

Relationship between the degree of structural diversity (measured by the number of SSGs) and population of the superfamilies in the genomes (number of sequences).

Relationship between the degree of structural diversity (measured by the number of SSGs) and population of the superfamilies in the genomes (number of sequences). If we consider the different SSGs to represent distinct ‘folds’ within these superfamilies, then instead of the 1110 ‘fold groups’ (defined by the Topology level in CATH version 3.2) there would be 3118 ‘fold groups’ and some superfamilies would have multiple ‘folds’. However, although examples of dramatic fold changes are known (19), they are rare and the majority of gross structural changes that occur within a superfamily result from extensive structural embellishments to the common core rather than a dramatic change within the core. Therefore, the CATH hierarchical classification is not challenged if we consider a more appropriate definition of the T-level or topology level in CATH to be a grouping of structures sharing a common fold in the core of the domain. A file containing the number of SSGs contained in each superfamily in CATH can be downloaded from (http://www.cathdb.info/download#version_v3.1) We also investigated whether structures in different superfamilies were structurally similar (i.e. SiMAX <5Å). We observed relatively little overlap between different superfamilies and fold groups for a SiMAX threshold of <5Å. As the threshold is increased, however, more overlaps do occur between some architectures, such as the α up-down bundle, α-orthogonal bundle, β-sandwiches and αβ-sandwiches (Figure 3). This is largely due to the presence of small common super-secondary motifs, such as the α-hairpin, β-hairpin and αβ-motif. Superfamilies that exhibit no structural overlaps at all tend to have very distinctive folds, such as the β-trefoil fold, with unusual motifs or unusual combinations of common motifs.

Figure 3.

Plot showing the percentage of superfamilies that overlap (red) and show structural diversity, or drift (>5 SSG's) (blue) for different SiMAX cuff-offs.

Plot showing the percentage of superfamilies that overlap (red) and show structural diversity, or drift (>5 SSG's) (blue) for different SiMAX cuff-offs. A structural overlap matrix of SiMAX scores created via the all-against-all CATHEDRAL analysis is downloadable from the CATH website (see http://www.cathdb.info/download#version_v3.1) so that users can perform their own analyses on a CATH-based protein structure universe.

REDESIGNED WEBSITE

The CATH database can be accessed at http://www.cathdb.info. The web interface has been completely redesigned since version 3.1. Documentation, such as an FAQ, tutorials, a glossary, downloadable data files and staff webpages have also been created and are being maintained through a open source wiki software package. This will be frequently updated.

SUMMARY

In the light of our analyses on structural diversity in CATH, it is clear that the T-level provides a clustering of domain structures having similar folds in their domain cores. For each superfamily, information on the variety of different decorations to this common structural core is provided as distinct SSGs within the superfamily. Multiple structural alignments will shortly be provided for each SSG in order to highlight common secondary structures in the domain core and embellishments to this core.

FUNDING

Funding for open access charge: BBSRC. Conflict of interest statement. None declared.

18 in total

Review 1. Beta propellers: structural rigidity and functional diversity.

Authors: V Fülöp; D T Jones
Journal: Curr Opin Struct Biol Date: 1999-12 Impact factor: 6.809

Review 2. Fold change in evolution of protein structures.

Authors: N V Grishin
Journal: J Struct Biol Date: 2001 May-Jun Impact factor: 2.867

3. A 'periodic table' for protein structures.

Authors: William R Taylor
Journal: Nature Date: 2002-04-11 Impact factor: 49.962

4. Structural diversity of domain superfamilies in the CATH database.

Authors: Gabrielle A Reeves; Timothy J Dallman; Oliver C Redfern; Adrian Akpor; Christine A Orengo
Journal: J Mol Biol Date: 2006-06-02 Impact factor: 5.469

5. A folding space odyssey.

Authors: Alan R Davidson
Journal: Proc Natl Acad Sci U S A Date: 2008-02-19 Impact factor: 11.205

6. A discrete view on fold space.

Authors: Manfred J Sippl; Stefan J Suhrer; Markus Gruber; Markus Wiederstein
Journal: Bioinformatics Date: 2008-01-24 Impact factor: 6.937

7. Crystal structure of RNA 3'-terminal phosphate cyclase, a ubiquitous enzyme with unusual topology.

Authors: G J Palm; E Billy; W Filipowicz; A Wlodawer
Journal: Structure Date: 2000-01-15 Impact factor: 5.006

8. The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution.

Authors: Lesley H Greene; Tony E Lewis; Sarah Addou; Alison Cuff; Tim Dallman; Mark Dibley; Oliver Redfern; Frances Pearl; Rekha Nambudiry; Adam Reid; Ian Sillitoe; Corin Yeats; Janet M Thornton; Christine A Orengo
Journal: Nucleic Acids Res Date: 2006-11-29 Impact factor: 16.971

9. Gene3D: comprehensive structural and functional annotation of genomes.

Authors: Corin Yeats; Jonathan Lees; Adam Reid; Paul Kellam; Nigel Martin; Xinhui Liu; Christine Orengo
Journal: Nucleic Acids Res Date: 2007-11-21 Impact factor: 16.971

10. CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures.

Authors: Oliver C Redfern; Andrew Harrison; Tim Dallman; Frances M G Pearl; Christine A Orengo
Journal: PLoS Comput Biol Date: 2007-11 Impact factor: 4.475

82 in total

1. Real-time ligand binding pocket database search using local surface descriptors.

Authors: Rayan Chikhi; Lee Sael; Daisuke Kihara
Journal: Proteins Date: 2010-07

2. Structural and biochemical basis of Yos9 protein dimerization and possible contribution to self-association of 3-hydroxy-3-methylglutaryl-coenzyme A reductase degradation ubiquitin-ligase complex.

Authors: Jennifer Hanna; Anja Schütz; Franziska Zimmermann; Joachim Behlke; Thomas Sommer; Udo Heinemann
Journal: J Biol Chem Date: 2012-01-18 Impact factor: 5.157

3. How significant is a protein structure similarity with TM-score = 0.5?

Authors: Jinrui Xu; Yang Zhang
Journal: Bioinformatics Date: 2010-02-17 Impact factor: 6.937

Review 4. Protein folds and protein folding.

Authors: R Dustin Schaeffer; Valerie Daggett
Journal: Protein Eng Des Sel Date: 2010-11-03 Impact factor: 1.650

5. Generation of a consensus protein domain dictionary.

Authors: R Dustin Schaeffer; Amanda L Jonsson; Andrew M Simms; Valerie Daggett
Journal: Bioinformatics Date: 2010-11-09 Impact factor: 6.937

6. Detailed analysis of function divergence in a large and diverse domain superfamily: toward a refined protocol of function classification.

Authors: Benoit H Dessailly; Oliver C Redfern; Alison L Cuff; Christine A Orengo
Journal: Structure Date: 2010-11-10 Impact factor: 5.006