| Literature DB >> 22978315 |
Elin Teppa1, Angela D Wilkins, Morten Nielsen, Cristina Marino Buslje.
Abstract
BACKGROUND: A large panel of methods exists that aim to identify residues with critical impact on protein function based on evolutionary signals, sequence and structure information. However, it is not clear to what extent these different methods overlap, and if any of the methods have higher predictive potential compared to others when it comes to, in particular, the identification of catalytic residues (CR) in proteins. Using a large set of enzymatic protein families and measures based on different evolutionary signals, we sought to break up the different components of the information content within a multiple sequence alignment to investigate their predictive potential and degree of overlap.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22978315 PMCID: PMC3515339 DOI: 10.1186/1471-2105-13-235
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Different column patterns in an MSA. Schematic representation of an MSA and its phylogenetic tree (left). Conserved position is highlighted in red, coevolved positions in green and orange and putative SDPs in yellow and blue. On the top are indicated the column pattern and on the bottom, the suitable method to detect each kind of position (C: conservation score; cMI: cumulative MI; ivET: integer value ET; rvET: real value ET; XDET and SDPfox are also indicated).
Figure 2Heat map representation of the Spearman rank correlation coefficient between methods. cMI: cumulative MI, ivET: integer value evolutionary trace, rvET: real value evolutionary trace, cons: conservation. Numbers following the methods name (100, 62 and 50) indicate the redundancy of the sequences in the MSA (100, 62 and 50% redundancy reduced). The dendrogram indicates the distance between methods. Correlation colour key goes from white (0, no correlation) to blue (1, perfect correlation). All correlations are statistically different from zero (T-test, p-value threshold of 0.05).
Performance and optimal distance threshold of the proximity measures for detecting catalytic residues
| p(SDPfox62) | 0.703 | 12 |
| p(XDET50) | 0.736 | 8 |
| p(ivET62) | 0.835 | 7 |
| p(ivET100) | 0.640 | 7 |
| p(rvET62) | 0.878 | 5 |
| p(rvET100) | 0.875 | 7 |
| p(MI62) | 0.823 | 7 |
| p(MI100) | 0.833 | 7 |
| p(C) | 0.854 | 5 |
The number of protein families included in SDPfox is 289, 298 in XDET and 424 in all other methods. “p” before the method’s name denotes “proximity”. The number following the method’s name denotes the MSA data set on which the method was evaluated (ie: 50 = MSA50). The optimal distance cut-off for the proximity sum was found using a grid-search as described in Methods.
Performance of different methods in terms of the AUC
| C | 0.881 | 0.491 |
| 0.2 C + 0.8 p(C) | 0.898 | 0.553 |
| 0.15 C + 0.85 p(rvET62) | 0.913 | 0.567 |
| 0.25 C + 0.75 p(MI62) | 0.912 | 0.555 |
| 0.15 C + 0.0 p(C) + 0.85 p(rvET62) | 0.913 | 0.567 |
| 0.15 C + 0.3 p(C) + 0.55 p(MI62) | 0.916 | 0.571 |
| 0.15 C + 0.0 p(C) + 0.45 p(rvET62) + 0.4 p(MI62) | 0.921 | 0.586 |
Methods give the optimal combined model including conservation and the different proximity scores. The relative weights were determined using fivefold cross validation as described in the text. AUC and AUC0.1 are the average performance values over the 424 protein families.