| Literature DB >> 31319797 |
Linus Scheibenreif1, Maria Littmann2, Christine Orengo3, Burkhard Rost4,5,6,7.
Abstract
BACKGROUND: The CATH database provides a hierarchical classification of protein domain structures including a sub-classification of superfamilies into functional families (FunFams). We analyzed the similarity of binding site annotations in these FunFams and incorporated FunFams into the prediction of protein binding residues.Entities:
Keywords: Binding residue prediction; CATH; Functional families; Protein binding sites; Protein families; Protein function
Mesh:
Substances:
Year: 2019 PMID: 31319797 PMCID: PMC6639920 DOI: 10.1186/s12859-019-2988-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Concept of using FunFam to filter binding residue predictions. For the example of protein glutathione S-transferase (identifier 1U3I [17, 18]) binding glutathione. The binding residues were shown on the structure using PyMol [19]. Correctly predicted binding residues (TP) are shown in darkblue, incorrectly predicted non-binding residues (FN) in lightblue, and incorrectly predicted binding residues (FP) in red. a Poor binding prediction: Some prediction method (here BindPredict-CCS) might correctly identify only a small fraction of all binding residues (here in red with a precision = recall = F1 = 11%). The method might even incorrectly over-predict more residues as binding (red) and might miss more observed binding residues (lightblue) than it gets right. b FunFam filter with 1% prediction agreement: Simply filtering the prediction by requiring that at least 1% of all proteins aligned at a particular residue position had the same binding residue prediction (consensus threshold = 0.01). For the example, given, this boosted recall to 67% (precision = 16%, F1 = 26%). c FunFam filter with 50% prediction agreement: Filtering the prediction by requiring consensus threshold of 0.5 (50% of the residues predicted equally) removed most predicted binding residues without removing the correctly predicted ones (correctly predicted residues shown in darkblue identical in a and c; precision = 20%, recall = 11%, F1 = 14%)
Average binding residue similarity for FunFams and EC-numbersa
| Group | Number of families | Number of proteins | Average binding residue similarity (Eq. |
|---|---|---|---|
|
| 1856 | 7172 | 36.9 ± 0.6 |
|
| 1080 | 5789 | 29.9 ± 0.8 |
|
| 1103 | 4143 | 38.6 ± 0.8 |
|
| 833 | 4143 | 34.5 ± 0.9 |
|
| 771 | 2893 | 9.6 ± 0.4 |
|
| 404 | 2817 | 27.0 ± 1.0 |
|
| 1006 | 4445 | 38.0 ± .0.01 |
|
| 435 | 1155 | 5.22 ± 0.01 |
aSame FunFams: proteins within same FunFam; Same EC-numbers: proteins with identical EC number; EC-FunFams subset: same subset used for both similarity calculation with FunFams and within EC classes; Same EC different FunFam: subset of proteins with identical EC number classified into different FunFams; Same FunFam different EC: subset of proteins from same FunFam with different EC numbers; Same EC, same superfamily: proteins with identical EC number grouped into a structural superfamily; Same EC, different superfamily: proteins with identical EC number grouped into different superfamilies; ±: refers to one standard error
Fig. 2Cumulative binding residue similarities for FunFam and EC-number. The x-axis gives the fraction of binding residue annotations (Eq. 1) agreeing between all pairs of proteins in the same functional “groups” according to different sources: the fat black line marks the similarity within FunFams [3] and the gray fat line marks the similarity within same EC number [2]. For comparison the complements are also shown, namely the sub-sets of proteins in the same FunFam but with different EC number (dashed dark line) and in different FunFams but with the same EC (dashed gray line). All curves give reversely cumulative numbers answering the question: how many protein families had a binding residue annotation similarity (Eq. 1) above the similarity threshold shown on the x-axis? The two panels show the absolute count of protein families (a) and the fraction of all families (b) on the y-axis. For instance, 60% or more of all binding residues (indicated by rightmost vertical gray line; the middle vertical gray line marks the 50%) agreed within 354 FunFams (corresponding to 19%) and 145 identical EC numbers (corresponding to 14%). The leftmost vertical gray line marks random binding residue similarity (5.5 ± 0.2%). Contrary to all other groups, proteins grouped by the same EC number and differing FunFams (dashed gray line) have similarity scores close to random. The middle vertical gray lines mark the 50 and 60
Fig. 3Leveraging FunFams to better predict binding residues. The horizontal lines indicate the performance estimates for precision (Eq. 2) and recall (Eq. 3) of BindPredict-CCS and BindPredict-CC baseline predictions not using FunFams. Predictions are refined by constructing consensus predictions for the FunFams. The x-axes give different thresholds in terms of what fraction of the FunFams members need to have a binding prediction for a particular residue in order to label that residue as binding in the consensus prediction: from at least one (0.01) to all (1.0). Depending on the threshold, both precision and recall significantly increase over the standard prediction method. The two panels illustrate the improvement over two slightly different baseline prediction methods: a BindPredict-CCS using the cumulative couplings-based input features. In this case precision increases up to 61 ± 4%. Panel b shows the performance improvement for BindPredict-CC using the clustering coefficient-based input features. For low thresholds, these predictions reach recall up to 50 ± 2%