Literature DB >> 30398663

CATH: expanding the horizons of structure-based functional annotations for genome sequences.

Ian Sillitoe¹, Natalie Dawson¹, Tony E Lewis¹, Sayoni Das¹, Jonathan G Lees¹, Paul Ashford¹, Adeyelu Tolulope¹, Harry M Scholes¹, Ilya Senatorov¹, Andra Bujan¹, Fatima Ceballos Rodriguez-Conde¹, Benjamin Dowling¹, Janet Thornton², Christine A Orengo¹.

Abstract

This article provides an update of the latest data and developments within the CATH protein structure classification database (http://www.cathdb.info). The resource provides two levels of release: CATH-B, a daily snapshot of the latest structural domain boundaries and superfamily assignments, and CATH+, which adds layers of derived data, such as predicted sequence domains, functional annotations and functional clustering (known as Functional Families or FunFams). The most recent CATH+ release (version 4.2) provides a huge update in the coverage of structural data. This release increases the number of fully- classified domains by over 40% (from 308 999 to 434 857 structural domains), corresponding to an almost two- fold increase in sequence data (from 53 million to over 95 million predicted domains) organised into 6119 superfamilies. The coverage of high-resolution, protein PDB chains that contain at least one assigned CATH domain is now 90.2% (increased from 82.3% in the previous release). A number of highly requested features have also been implemented in our web pages: allowing the user to view an alignment between their query sequence and a representative FunFam structure and providing tools that make it easier to view the full structural context (multi-domain architecture) of domains and chains.

Entities: Chemical Gene

Year: 2019 PMID： 30398663 PMCID： PMC6323983 DOI： 10.1093/nar/gky1097

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

CATH is an online, publicly accessible resource that provides a classification of protein domains based on structural data deposited into the worldwide Protein Data Bank (wwPDB) (1). The resource splits 3D structures into constituent domains (considered as semi-independently folding globular units) and clusters these domains into homologous superfamilies where there is sufficient evidence of evolutionary similarity (2). CATH uses a number of sensitive structure-comparison and sequence comparison tools (including SSAP (3), HMMER3 hmmer.org, PRC (4)) to assist the manual curation of these remote evolutionary relationships. Homologous superfamilies are organised into a hierarchy of Class, Architecture and Topology according to an increasing degree of structural similarity. Knowledge of 3D protein structure is hugely valuable when investigating remote evolutionary relationships and detailed functional mechanisms, however the number of solved 3D structures is relatively small compared to the number of protein sequences. In order to harness the information available in the sequence databases, CATH uses a set of representative structural domains to ‘seed’ a set of sequence alignments (HMMER, hmmer.org). These alignments are converted into hidden Markov models (HMMs) and the resulting library of sequence-based ‘fingerprints’ is used to identify the location of closely related domains within protein sequences taken from the genome databases UniProtKB (5) and Ensembl (6). By combining protein structure and sequence, the CATH resource provides comprehensive structure-based domain superfamily assignments to over 95 million protein sequences (7). In many cases, these annotated proteins are already associated with a considerable amount of additional functional annotation (e.g. UniProtKB provides annotations such Gene Ontology (GO) (8), Enzyme Commission (EC) terms (9) and Catalytic Site Atlas (10)). The Homologous Superfamily (H) level is useful for investigating distant evolutionary relationships as it groups domains that have the same structural core but may have structural embellishments to that core that result in diverse functions. Therefore, to investigate more finely tuned features, such as the evolution of functional sites, superfamilies are also broken into smaller, more focussed groups of domains called Functional Families (FunFams). FunFams aim to group domain sequences that perform a similar function and are created via a functional classification protocol based on distinguishing sequence patterns (11,12). The resulting clusters of domains have been found to be more functionally coherent than other domain-based resources (11). While this is a non-trivial task (13–16), it is extremely useful when attempting to correlate functional annotations with sequence patterns (11). Function prediction pipelines based on the FunFams have been consistently ranked among the top 10 function prediction methods by the international CAFA competition (17,18). Since structures within FunFams tend to be highly conserved (even at low sequence identity), these clusters have also been found to be useful when suggesting templates for 3D modelling (19). The resource allows users to search for matching FunFams by submitting a query protein sequence to the web pages (cathdb.info/search/by_sequence), or through the API (20). Finding a significant match to a FunFam can provide specific functional information (eg by mapping highly conserved positions onto the query sequence) and provide more general annotations such as most likely GO term annotations. In 2017, CATH was named as a Core Data Resource (CDR) within ELIXIR, one of only 18 CDRs across Europe (and one of only nine outside of the EBI) (21).

CATH v4.2 RELEASE HIGHLIGHTS

The most recent CATH+ release, version 4.2 (based on PDB as of July 2017), brings a huge number of annotations (125 858 newly classified domain from 24 765 newly processed protein structures from the wwPDB); an increase of 40% since CATH+ release 4.1 (based on PDB as of January 2015) (Figure 1). This was made possible through the implementation of a machine learning algorithm that allows annotations to be automatically inherited when sequence and structure comparisons to existing CATH domains met a carefully benchmarked set of criteria (22). Extensive manual curation enabled the recognition of new folds, and new superfamilies within known folds.

Figure 1.

Comparison of the structural domains and predicted (sequence) domains between CATH+ releases 4.1 and 4.2.

Comparison of the structural domains and predicted (sequence) domains between CATH+ releases 4.1 and 4.2. The corresponding sequence data for this release added an extra 45 million predicted domain sequences. Fitting these additional domain sequences into the alignments for each FunFam has increased the average information content in these alignments by 13% (where information content is calculated by DOPS score, a measure of the global similarity of sequences within an alignment (23)). This release also sees the number of superfamilies in CATH rise from 2737 (version 4.1) to 6119 (version 4.2) (Figure 2), though this only corresponds to a relatively small number of new folds (18) and one new architecture. Almost two-thirds of these new superfamilies (62%) correspond to superfamilies that contain only one representative structural domain (non-redundant at 95% sequence identity). In these cases, it is often possible to detect structural and sequence similarities to existing superfamilies however the evidence does not yet reach the required threshold for these new members to be merged. It is important to note that this evidence is being updated constantly as new sequences, structures and functional annotations are added to public databases so these superfamilies may be merged in future releases. As in previous releases, we offer strictly non-redundant datasets of CATH domains in which it is guaranteed that no two domains have more than 20%/40% sequence identity (at 60% overlap). If groups are interested in using CATH domains for benchmarking or training their own algorithms, it is strongly recommended that they take advantage of these datasets (see Availability).

Figure 2.

Superfamilies in CATH v4.1 and v4.2 highlighting the number of structural domains and predicted domains (shown with a linear and logarithmic scale). Each dot represents a superfamily and the largest 100 superfamilies (according to number of predicted domains in CATH v4.1) have been highlighted in red. These superfamilies contain more than half of all known protein domains. To help illustrate the growth of data, an example superfamily (3.60.20.10) has been circled in each plot. The number of structural domains in this superfamily increased 2.7-fold from v4.1 to v4.2 (3074 to 8328 domains), however this corresponded to only a small increase in the number of predicted domains. On investigation, this superfamily contains domains from a number of large proteasomes, which contain many copies of identical (or very similar) structural domains. CATH+ adds many layers of information on top of the classification of domain boundaries and Homologous superfamily, and the process of creating a CATH+ release takes approximately 12–18 months. The first step in the process of generating a CATH+ release is to ‘freeze’ the incoming data from the wwPDB. While this is important to ensure consistency when calculating additional layers of information, it also means that the latest PDB structures may be missing from a CATH+ release. To address this, we offer a daily snapshot containing the very latest domain boundary and superfamily annotations (CATH-B). At the time of writing (Sep, 2018), CATH-B contained a further 46 906 structural domains since CATH+ 4.2 was ‘frozen’ (July 2017).

ADDITIONAL FEATURES IN CATH

Improved superfamily names and descriptions

Since the last release, there has been considerable effort to review existing superfamily names and provide names for unnamed superfamilies. Superfamilies in CATH were mapped to a number of related resources (InterPro, Pfam, SCOP, UniProtKB) as a means to provide evidence for names that were unique and informative while still representing the potentially diverse members in each cluster. This manual curation work resulted in 1,255 superfamily names being reviewed and standardised with respect to equivalent clusters in InterPro. It also generated names and detailed descriptions for an additional 1215 CATH superfamilies that had not yet been annotated.

Alignment of query sequences against FunFams

The FunFHMMER server is a web tool that allows users to search a query protein sequence against a library of FunFam HMMs (11). A search typically takes a few seconds and then the user is provided with a list of significant FunFam matches with a visual indication of the location of each match on the original query sequence. In previous releases, the user was able to click through to web pages describing these FunFams in more detail (e.g. to view a multiple sequence alignment with highly conserved positions highlighted on a representative 3D structure), however they were not able to see their own sequence in that alignment. A highly requested feature has been to provide users with the opportunity to view their own query sequence aligned to this structure. This feature has been added and the query sequence is automatically inserted as the first sequence in the FunFam alignment when following links from the search page (Figure 3).

Figure 3.

Screenshot showing a query sequence aligned to a matching CATH FunFam (following a sequence search). Sequence conservation is calculated for each position in the alignment (blue is low conservation, red is high conservation) and these colours are mapped to a representative structure (if one is available).

Visualising structural context of domains

While it is important to consider CATH domains as semi-independent globular units, it is also important to consider the context in which each domain appears in the native protein, ie the multi-domain architecture. This is crucial when considering functional annotations (such as GO terms) since these are made at the protein level, rather than the domain level. To highlight the structural context of domains within the full protein, a new interactive component has been added to the pages that combines a 3D structural viewer (3DMol.js (24)) with a simple 1D schematic of the multi-domain architecture (MDA) of the protein chain (Figure 4). This enables users to view the domain within the context of the PDB chain (‘View Domain in Chain’) or in the context of the full PDB (‘View Domain in PDB’). In these modes, clicking a domain in the MDA adds focus to the domain in the 3D viewer while ghosting the surrounding domain partners.

Figure 4.

Interactive links between 3D structure (3DMol.js) and multi-domain architecture (MDA) allow the user to view each domain in the context of the full chain.

CONCLUSION

This article describes the latest release of CATH version 4.2 that contains over 430,000 structural domains and over 95 million predicted domain sequences. This represents a huge increase in coverage both in terms of coverage of the PDB and of sequences in UniProtKB. A number of highly requested features have also been implemented in our web pages: allowing the user to view an alignment between their query sequence and a representative FunFam structure and providing tools that make it easier to view the full structural context (multi-domain architecture) of domains and chains.

DATA AVAILABILITY

Daily snapshot of latest annotations (CATH-B) is available to download at ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/daily-release. CATH+ data is available to download at ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/all-releases/v4_2_0/ and can be browsed through the web site at http://www.cathdb.info Datasets of non-redundant domains are available to download at: ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/latest-release/non-redundant-data-sets/

22 in total

Review 1. Scoring residue conservation.

Authors: William S J Valdar
Journal: Proteins Date: 2002-08-01

2. webPRC: the Profile Comparer for alignment-based searching of public domain databases.

Authors: Bernd W Brandt; Jaap Heringa
Journal: Nucleic Acids Res Date: 2009-05-06 Impact factor: 16.971

3. New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures.

Authors: Ian Sillitoe; Alison L Cuff; Benoit H Dessailly; Natalie L Dawson; Nicholas Furnham; David Lee; Jonathan G Lees; Tony E Lewis; Romain A Studer; Robert Rentzsch; Corin Yeats; Janet M Thornton; Christine A Orengo
Journal: Nucleic Acids Res Date: 2012-11-29 Impact factor: 16.971

4. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data.

Authors: Helen Berman; Kim Henrick; Haruki Nakamura; John L Markley
Journal: Nucleic Acids Res Date: 2006-11-16 Impact factor: 16.971

5. Functional classification of CATH superfamilies: a domain-based approach for protein function annotation.

Authors: Sayoni Das; David Lee; Ian Sillitoe; Natalie L Dawson; Jonathan G Lees; Christine A Orengo
Journal: Bioinformatics Date: 2015-07-02 Impact factor: 6.937

6. 3Dmol.js: molecular visualization with WebGL.

Authors: Nicholas Rego; David Koes
Journal: Bioinformatics Date: 2014-12-12 Impact factor: 6.937

7. A large-scale evaluation of computational protein function prediction.

Authors: Predrag Radivojac; Wyatt T Clark; Tal Ronnen Oron; Alexandra M Schnoes; Tobias Wittkop; Artem Sokolov; Kiley Graim; Christopher Funk; Karin Verspoor; Asa Ben-Hur; Gaurav Pandey; Jeffrey M Yunes; Ameet S Talwalkar; Susanna Repo; Michael L Souza; Damiano Piovesan; Rita Casadio; Zheng Wang; Jianlin Cheng; Hai Fang; Julian Gough; Patrik Koskinen; Petri Törönen; Jussi Nokso-Koivisto; Liisa Holm; Domenico Cozzetto; Daniel W A Buchan; Kevin Bryson; David T Jones; Bhakti Limaye; Harshal Inamdar; Avik Datta; Sunitha K Manjari; Rajendra Joshi; Meghana Chitale; Daisuke Kihara; Andreas M Lisewski; Serkan Erdin; Eric Venner; Olivier Lichtarge; Robert Rentzsch; Haixuan Yang; Alfonso E Romero; Prajwal Bhat; Alberto Paccanaro; Tobias Hamp; Rebecca Kaßner; Stefan Seemayer; Esmeralda Vicedo; Christian Schaefer; Dominik Achten; Florian Auer; Ariane Boehm; Tatjana Braun; Maximilian Hecht; Mark Heron; Peter Hönigschmid; Thomas A Hopf; Stefanie Kaufmann; Michael Kiening; Denis Krompass; Cedric Landerer; Yannick Mahlich; Manfred Roos; Jari Björne; Tapio Salakoski; Andrew Wong; Hagit Shatkay; Fanny Gatzmann; Ingolf Sommer; Mark N Wass; Michael J E Sternberg; Nives Škunca; Fran Supek; Matko Bošnjak; Panče Panov; Sašo Džeroski; Tomislav Šmuc; Yiannis A I Kourmpetis; Aalt D J van Dijk; Cajo J F ter Braak; Yuanpeng Zhou; Qingtian Gong; Xinran Dong; Weidong Tian; Marco Falda; Paolo Fontana; Enrico Lavezzo; Barbara Di Camillo; Stefano Toppo; Liang Lan; Nemanja Djuric; Yuhong Guo; Slobodan Vucetic; Amos Bairoch; Michal Linial; Patricia C Babbitt; Steven E Brenner; Christine Orengo; Burkhard Rost; Sean D Mooney; Iddo Friedberg
Journal: Nat Methods Date: 2013-01-27 Impact factor: 28.547

8. Functional site plasticity in domain superfamilies.

Authors: Benoit H Dessailly; Natalie L Dawson; Kenji Mizuguchi; Christine A Orengo
Journal: Biochim Biophys Acta Date: 2013-03-14

Review 9. The evolution of enzyme function in the isomerases.

Authors: Sergio Martinez Cuesta; Nicholas Furnham; Syed Asad Rahman; Ian Sillitoe; Janet M Thornton
Journal: Curr Opin Struct Biol Date: 2014-07-05 Impact factor: 6.809

Review 10. New insights about enzyme evolution from large scale studies of sequence and structure relationships.

Authors: Shoshana D Brown; Patricia C Babbitt
Journal: J Biol Chem Date: 2014-09-10 Impact factor: 5.157

44 in total

1. The IMG/M data management and analysis system v.6.0: new tools and advanced capabilities.

Authors: I-Min A Chen; Ken Chu; Krishnaveni Palaniappan; Anna Ratner; Jinghua Huang; Marcel Huntemann; Patrick Hajek; Stephan Ritter; Neha Varghese; Rekha Seshadri; Simon Roux; Tanja Woyke; Emiley A Eloe-Fadrosh; Natalia N Ivanova; Nikos C Kyrpides
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

CATH: expanding the horizons of structure-based functional annotations for genome sequences.

INTRODUCTION

CATH v4.2 RELEASE HIGHLIGHTS

ADDITIONAL FEATURES IN CATH

Improved superfamily names and descriptions

Alignment of query sequences against FunFams

Visualising structural context of domains

CONCLUSION

DATA AVAILABILITY

Review 1. Scoring residue conservation.

2. webPRC: the Profile Comparer for alignment-based searching of public domain databases.

3. New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures.

4. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data.

5. Functional classification of CATH superfamilies: a domain-based approach for protein function annotation.

6. 3Dmol.js: molecular visualization with WebGL.

7. A large-scale evaluation of computational protein function prediction.

8. Functional site plasticity in domain superfamilies.

Review 9. The evolution of enzyme function in the isomerases.

Review 10. New insights about enzyme evolution from large scale studies of sequence and structure relationships.

1. The IMG/M data management and analysis system v.6.0: new tools and advanced capabilities.

Review 2. Adaptability and specificity: how do proteins balance opposing needs to achieve function?

3. β-Strand-mediated interactions of protein domains.

Review 4. A guide to machine learning for biologists.

5. Bioinformatics for the Origin and Evolution of Viruses.

6. Three-dimensional Structure Databases of Biological Macromolecules.

Review 7. A global analysis of function and conservation of catalytic residues in enzymes.

8. SidechainNet: An all-atom protein structure dataset for machine learning.

9. Structure of the activated ROQ1 resistosome directly recognizing the pathogen effector XopQ.

Review 10. Evolution, folding, and design of TIM barrels and related proteins.