| Literature DB >> 19374763 |
Gergely Csaba1, Fabian Birzele, Ralf Zimmer.
Abstract
BACKGROUND: SCOP and CATH are widely used as gold standards to benchmark novel protein structure comparison methods as well as to train machine learning approaches for protein structure classification and prediction. The two hierarchies result from different protocols which may result in differing classifications of the same protein. Ignoring such differences leads to problems when being used to train or benchmark automatic structure classification methods. Here, we propose a method to compare SCOP and CATH in detail and discuss possible applications of this analysis.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19374763 PMCID: PMC2678134 DOI: 10.1186/1472-6807-9-23
Source DB: PubMed Journal: BMC Struct Biol ISSN: 1472-6807
Mapping of the domain definitions of the two hierarchies
| 1 | 2 | 3 | 4 | 5 | 6 | |
| SCOP | 49'251 | 17'162 | 1'885 | 435 | 130 | 29 |
| CATH | 68'270 | 11'018 | 492 | 3 | 0 | 0 |
Mapping of the domain definitions of the two hierarchies. An overlap threshold > 0 is used, i.e. all domains which share at least one residue are mapped onto each other. The SCOP row shows the number of CATH domains mapped onto a single SCOP domain, while the CATH row describes the number of SCOP domains mapped onto one domain defined in CATH. A single domain in SCOP may be partitioned into up to 6 domains in CATH. Overall, about 20'000 (19'641) out of about 70'000 (68'892) SCOP domains map more than one CATH domain while about 11'500 (11'513) out of about 80'000 (79'783) CATH single domains map to more than one SCOP domain.
Mapping distribution of SCOP onto CATH nodes
| F > 0 | Unmapped | C | A | T | H |
| fold class | 0 | 1 | 4 | ||
| Fold | 0 | 0 | 5 | 236 | |
| superfamily | 0 | 0 | 2 | 32 | |
| Family | 0 | 0 | 1 | 9 | |
| F > 0.8 | Unmapped | C | A | T | H |
| fold class | 8 | 0 | 1 | ||
| Fold | 125 | 0 | 4 | 177 | |
| superfamily | 236 | 0 | 1 | 24 | |
| Family | 1'055 | 0 | 1 | 6 | |
Number of inner nodes from a hierarchy level in SCOP mapping best to a node from some level in CATH. Consistent mapping are displayed in bold. Two different F-measure thresholds of 0 and 0.8 are shown. For example, 504 SCOP folds map best to a CATH topology node given a threshold of 0 dropping down to 439 nodes for a F-measure threshold of 0.8.
Inconsistencies between SCOP and CATH
| consistent | inconsistent | folds | superfamilies | |
| family | 133'335 | 70 | 102 | |
| superfamily | 713'181 | 121 | 159 | |
| fold | 2'389'191 | 84 | 500 | |
| class | 62'849'692 | 745 | 1'258 | |
| other class | 249'897'353 | 745 | 1'258 |
Shows the inconsistencies between SCOP and CATH with respect to the levels of the SCOP hierarchy. The second column displays the number of consistent pairs (pairs of proteins from folds, superfamilies and families in the cells marked bold in Table 4) and the third column the number of inconsistent pairs. Columns four and five display the number of distinct folds and superfamilies which account for the inconsistencies observed.
Detailed mappings of domain pairs in percent from SCOP onto CATH
| outer | class | fold | superfamily | family | |
| outer | 8.31% | 0.99% | 0.40% | 0.03% | |
| class | 18.16% | 2.55% | 1.88% | 0.87% | |
| arch | 2.42% | 2.80% | 1.27% | 0.09% | |
| top | 0.04% | 10.50% | 4.44% | 0.66% | |
| hom | 0.002% | 0.14% | 11.66% |
Displays the detailed mappings of domain pairs in percent from SCOP (columns) onto CATH (rows). Columns sum up to 100% and table cells marked in bold display consistent mappings. Please note that due to the very large number of pairs, even small percentage values correspond to many examples (see Table 3 for details).
Figure 1Detailed comparison of protein structure benchmark sets. The figure compares the performance of TM-align on the complete set of similarity relationships defined by SCOP (left column) and the performance on the novel SCOP-CATH consensus benchmark set proposed in this study (right column). For this purpose, the TM-Align performance is visualized via various plots which show in some detail the evaluation of classification errors. Panels (a) and (b) shows the distribution of scores for the various levels of the classifications. Although the fold scores are somewhat shifted to the right, the score distributions overlap significantly, which allows no clear thresholds for safe classifications of structure pairs. Panels (b)-(f) compare the various errors for the comprehensive and consensus benchmark sets. As errors we count wrong domains scored better than correct domains. The errors are significantly reduced on the consensus set (d) and (f). Finally, in panels (g)-(h) the errors (number of wrong folds scored better than certain correct folds) are summarized as boxplots. Again less errors are observed in the consensus set: whereas for the best scored correct domains quite few wrong folds are scored better in both sets, quite many better scoring but wrong folds are observed for the correct members with low scores. See main text for a more detailed description. Overall the number of errors is reduced over-proportionally (about 50% error reduction) as compared to the reduction of pairs in the consensus benchmark (about 16% pairs reduction).
Figure 2Linking different folds via consistency checks. a) Shows the method of connecting different folds in i.e. SCOP via a link proposed by the mapping of SCOP and CATH. Nodes in the graph represent SCOP folds, edges connect two nodes iff at least 5 members of the SCOP fold are mapped to the same CATH topology b) Shows the interfold similarity of α-hairpin proteins in SCOP which are clustered in the same fold according to CATH (1.10.287). c) Shows a more complicated fold graph clustering proteins of immunoglobulin (CATH 2.60.40) and jelly-roll topologies (CATH 2.60.120) in a non-clique subgraph. All fold graphs may be interactively explored on .