| Literature DB >> 32837711 |
Guillaume Postic1,2,3,4, Nathalie Janel2, Pierre Tufféry1,4, Gautier Moroy1.
Abstract
For three decades now, knowledge-based scoring functions that operate through the "potential of mean force" (PMF) approach have continuously proven useful for studying protein structures. Although these statistical potentials are not to be confused with their physics-based counterparts of the same name-i.e. PMFs obtained by molecular dynamics simulations-their particular success in assessing the native-like character of protein structure predictions has lead authors to consider the computed scores as approximations of the free energy. However, this physical justification is a matter of controversy since the beginning. Alternative interpretations based on Bayes' theorem have been proposed, but the misleading formalism that invokes the inverse Boltzmann law remains recurrent in the literature. In this article, we present a conceptually new method for ranking protein structure models by quality, which is (i) independent of any physics-based explanation and (ii) relevant to statistics and to a general definition of information gain. The theoretical development described in this study provides new insights into how statistical PMFs work, in comparison with our approach. To prove the concept, we have built interatomic distance-dependent scoring functions, based on the former and new equations, and compared their performance on an independent benchmark of 60,000 protein structures. The results demonstrate that our new formalism outperforms statistical PMFs in evaluating the quality of protein structural decoys. Therefore, this original type of score offers a possibility to improve the success of statistical PMFs in the various fields of structural biology where they are applied. The open-source code is available for download at https://gitlab.rpbs.univ-paris-diderot.fr/src/ig-score.Entities:
Keywords: Knowledge-based scoring functions; Model quality assessment; Protein structure prediction; Statistical potentials
Year: 2020 PMID: 32837711 PMCID: PMC7431362 DOI: 10.1016/j.csbj.2020.08.013
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Accuracy in ranking pairs of decoy structures from 3DRobot. A 50.0% value would correspond to a random ranking. The “near-native”, “good”, “medium”, and “poor” model qualities correspond to score (TM-score or GDT_TS) intervals [1.0, 0.8[, [0.8, 0.6[, [0.6, 0.4[, and [0.4, 0.0], respectively.
| Accuracy (%) | |||||||
|---|---|---|---|---|---|---|---|
| Model quality | Score | PMF | MCK1 | MCK2 | DOPE | TIG | GOAP |
| Near-native | TM-score | 67.8 | 69.5 | 70.3 | 67.0 | 71.6 | 91.5 |
| Good | 70.6 | 73.3 | 74.2 | 71.4 | 75.1 | 86.8 | |
| Medium | 71.2 | 72.9 | 73.6 | 75.6 | 75.2 | 80.8 | |
| Poor | 68.4 | 69.2 | 70.1 | 71.4 | 70.6 | 76.2 | |
| Near-native | GDT_TS | 63.5 | 64.8 | 65.7 | 61.8 | 67.0 | 88.0 |
| Good | 68.3 | 69.7 | 70.6 | 67.9 | 70.8 | 85.3 | |
| Medium | 73.5 | 75.3 | 75.8 | 76.5 | 77.0 | 86.5 | |
| Poor | 66.8 | 68.3 | 68.8 | 71.1 | 70.2 | 76.0 | |
Fig. 1Examples of protein models correctly and incorrectly ranked with the information-gain based approach, TIG. For each example, the better and worse models are represented in blue and red, respectively. (A) Predicted structures of the CASP13 target T1006 (magnetosome protein MamM) correctly ranked by TIG, but incorrectly ranked by the PMF, mock, and DOPE scoring functions. (B) Decoy structures of the ATP-binding subunit ClpC1 of the Clp protease (PDB code 3wdeA) from the 3DRobot dataset, which are correctly ranked by all methods except TIG. (C) Predicted structures of the target T0971 (terfestatin biosynthesis enzyme TerC), for which only TIG fails. (D) Decoy structures of the DUB domain of the human zinc metalloprotease AMSH-LP (PDB code 2znrA), for which only TIG succeeds. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Ranks predicted by the TIG, DOPE, mock, and PMF scores, averaged for three categories of models. For the “near-native” and “good” models, the lower the value (the higher the rank), the better the performance. Conversely, for models of “poor” quality, the lower the rank, the better.
| Average predicted rank | |||||||
|---|---|---|---|---|---|---|---|
| Model quality | Score | PMF | MCK1 | MCK2 | DOPE | TIG | GOAP |
| Near-native | TM-score | 68.1 | 64.8 | 63.9 | 67.1 | 61.9 | 46.3 |
| Good | 123.4 | 122.3 | 122.2 | 120.5 | 121.9 | 123.6 | |
| Near-native | GDT_TS | 63.1 | 59.5 | 58.4 | 61.5 | 55.9 | 38.4 |
| Good | 102.5 | 100.6 | 100.2 | 101.0 | 100.0 | 98.1 | |
| Poor | TM-score | 222.9 | 226.2 | 227.0 | 229.1 | 229.3 | 237.3 |
| Poor | GDT_TS | 222.0 | 224.9 | 225.5 | 226.3 | 227.3 | 234.8 |
Fig. 2Predicted quality (TIG score) of decoy structures from 3DRobot plotted against their true quality (TM-score). The Pearson correlation coefficient r is given for each example. (A) Conserved domain of nonstructural protein 3 (nsP3) from SARS coronavirus (PDB code 2acfA; 182 residues). (B) Dihydroneopterin aldolase from Escherichia coli (PDB code 2o90A; 122 residues). (C) Catalytic domain of the DNA glycosylase MutY (PDB code 1munA; 225 residues). (D) Protoglobin from Methanosarcina acetivorans (PDB code 3qzxA; 195 residues).
Fig. 3Score profiles from the TIG (blue) and PMF (green) methods. The interacting atoms are the Cα of the (A) Cys-Cys, (B) Asp-Glu, (C) Val-Val, and (D) Lys-Arg residue pairs. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)