| Literature DB >> 30398663 |
Ian Sillitoe1, Natalie Dawson1, Tony E Lewis1, Sayoni Das1, Jonathan G Lees1, Paul Ashford1, Adeyelu Tolulope1, Harry M Scholes1, Ilya Senatorov1, Andra Bujan1, Fatima Ceballos Rodriguez-Conde1, Benjamin Dowling1, Janet Thornton2, Christine A Orengo1.
Abstract
This article provides an update of the latest data and developments within the CATH protein structure classification database (http://www.cathdb.info). The resource provides two levels of release: CATH-B, a daily snapshot of the latest structural domain boundaries and superfamily assignments, and CATH+, which adds layers of derived data, such as predicted sequence domains, functional annotations and functional clustering (known as Functional Families or FunFams). The most recent CATH+ release (version 4.2) provides a huge update in the coverage of structural data. This release increases the number of fully- classified domains by over 40% (from 308 999 to 434 857 structural domains), corresponding to an almost two- fold increase in sequence data (from 53 million to over 95 million predicted domains) organised into 6119 superfamilies. The coverage of high-resolution, protein PDB chains that contain at least one assigned CATH domain is now 90.2% (increased from 82.3% in the previous release). A number of highly requested features have also been implemented in our web pages: allowing the user to view an alignment between their query sequence and a representative FunFam structure and providing tools that make it easier to view the full structural context (multi-domain architecture) of domains and chains.Entities:
Year: 2019 PMID: 30398663 PMCID: PMC6323983 DOI: 10.1093/nar/gky1097
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Comparison of the structural domains and predicted (sequence) domains between CATH+ releases 4.1 and 4.2.
Figure 2.Superfamilies in CATH v4.1 and v4.2 highlighting the number of structural domains and predicted domains (shown with a linear and logarithmic scale). Each dot represents a superfamily and the largest 100 superfamilies (according to number of predicted domains in CATH v4.1) have been highlighted in red. These superfamilies contain more than half of all known protein domains. To help illustrate the growth of data, an example superfamily (3.60.20.10) has been circled in each plot. The number of structural domains in this superfamily increased 2.7-fold from v4.1 to v4.2 (3074 to 8328 domains), however this corresponded to only a small increase in the number of predicted domains. On investigation, this superfamily contains domains from a number of large proteasomes, which contain many copies of identical (or very similar) structural domains.
Figure 3.Screenshot showing a query sequence aligned to a matching CATH FunFam (following a sequence search). Sequence conservation is calculated for each position in the alignment (blue is low conservation, red is high conservation) and these colours are mapped to a representative structure (if one is available).
Figure 4.Interactive links between 3D structure (3DMol.js) and multi-domain architecture (MDA) allow the user to view each domain in the context of the full chain.