| Literature DB >> 27857150 |
G Orlando1,2,3, D Raimondi1,2,3, W F Vranken1,2,3.
Abstract
Next Generation Sequencing is dramatically increasing the number of known protein sequences, with related experimentally determined protein structures lagging behind. Structural bioinformatics is attempting to close this gap by developing approaches that predict structure-level characteristics for uncharacterized protein sequences, with most of the developed methods relying heavily on evolutionary information collected from homologous sequences. Here we show that there is a substantial observational selection bias in this approach: the predictions are validated on proteins with known structures from the PDB, but exactly for those proteins significantly more homologs are available compared to less studied sequences randomly extracted from Uniprot. Structural bioinformatics methods that were developed this way are thus likely to have over-estimated performances; we demonstrate this for two contact prediction methods, where performances drop up to 60% when taking into account a more realistic amount of evolutionary information. We provide a bias-free dataset for the validation for contact prediction methods called NOUMENON.Entities:
Mesh:
Year: 2016 PMID: 27857150 PMCID: PMC5114557 DOI: 10.1038/srep36679
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Overview of the analysis.
There is a significant difference in the number of homologs that can be retrieved for a protein with and without a solved structure. This can lead to an overestimation of the performances of methods that use this kind of information, as we show for contact prediction, where this effect is very strong.
Figure 2(a) Distributions of the number of homologous sequences retrieved by jackhmmer (with 1 iteration and E-value = 0.0001) for NOSTRUCT, STRUCT and PSICOV datasets. (b) Distributions of the NEFF scores calculated on the homologs retrieved by jackhmmer for NOSTRUCT, STRUCT and PSICOV datasets. (c) Distributions of the average entropy for the alignments in the three datasets.
Figure 3(a) Shows the correlation between the NEFF and the PSICOV performances on 150 proteins sampled from the STRUCT dataset (Pearson’s correlation coefficient is 0.83). (b) Shows the correlation between the number of homologs (expressed in thousands of homologs) and PSICOV performances on the same proteins (Pearson’s correlation coefficient is 0.70).
Figure 4Plots showing the medians of the performances of CCMpred and PSICOV on NOUMENON dataset (magenta) and PSICOV dataset (green).
The shaded area indicates for each iteration the data between the 40th and the 60th percentile and between the 25th and 75th percentile.