| Literature DB >> 34099703 |
Sony Malhotra1,2, Agnel Praveen Joseph3, Jeyan Thiyagalingam3, Maya Topf4,5.
Abstract
Structures of macromolecular assemblies derived from cryo-EM maps often contain errors that become more abundant with decreasing resolution. Despite efforts in the cryo-EM community to develop metrics for map and atomistic model validation, thus far, no specific scoring metrics have been applied systematically to assess the interface between the assembly subunits. Here, we comprehensively assessed protein-protein interfaces in macromolecular assemblies derived by cryo-EM. To this end, we developed Protein Interface-score (PI-score), a density-independent machine learning-based metric, trained using the features of protein-protein interfaces in crystal structures. We evaluated 5873 interfaces in 1053 PDB-deposited cryo-EM models (including SARS-CoV-2 complexes), as well as the models submitted to CASP13 cryo-EM targets and the EM model challenge. We further inspected the interfaces associated with low-scores and found that some of those, especially in intermediate-to-low resolution (worse than 4 Å) structures, were not captured by density-based assessment scores. A combined score incorporating PI-score and fit-to-density score showed discriminatory power, allowing our method to provide a powerful complementary assessment tool for the ever-increasing number of complexes solved by cryo-EM.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34099703 PMCID: PMC8184972 DOI: 10.1038/s41467-021-23692-x
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Workflow for developing a protein–protein interface-based score (PI-score) to assess macromolecular assemblies derived using cryo-EM.
High resolution complexes (with > = two chains) were obtained from the PDB and are referred to as the ‘positive dataset 1’ (PD1). Protein–protein docking was used to derive structurally close (to PD1) complexes that form the ‘positive dataset 2’ (PD2). The complexes obtained upon docking that have a higher interface RMSD (iRMSD) and lower fraction of aligned native residues (fNal) at the interface form the ‘negative dataset’ (ND). Interface features are calculated on all the complexes and are used as an input to train a supervised machine-learning classifier, which is further used to predict the class labels of the benchmark dataset.
Fig. 2Machine learning-based classifier to assess the quality of protein–protein interfaces.
a Importance of interface features in distinguishing the ‘native-like’ interface. The ranks calculated using different methods (Ridge, Random Forest (RF), Recursive feature elimination (RFE), Linear regression (Linear reg) and Lasso) were normalised between 0 and 1 and the mean feature rank is plotted in black. b, c Performance of different classifiers on the training dataset: RF (random forest), SVM (support vector machine), NN (neural networks), and GB (gradient boost) are used to perform supervised learning on the training dataset using stratified shuffle split as a means of cross-validation with ten splits. The performance is evaluated using accuracy, precision, F1, recall scores and Matthews correlation coefficient. Performance measures of Model A (b): trained on docking-derived positive dataset (PD2) and negative dataset (ND). Performance measures of Model B (c): trained using both high-resolution and docking-derived positive datasets (PD1 + PD2) and negative dataset (ND). d Fraction of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) in different PI-score thresholds. The fractions (Y-axis) are averaged over the ten splits (stratified shuffle split) of the data. The different PI-score thresholds (X-axis) are indicated in absolute values.
Performance in different bins of the scores using the SVM machine learning-based classifier.
| Scores’ bins | TPR (True Positive Rate) | FPR (False positive Rate) | Precision | Specificity |
|---|---|---|---|---|
| (−/+)[0.0 to 0.5] | 0.15 | 0.76 | 0.15 | 0.24 |
| (−/+) [0.5 to 1.0] | 0.29 | 0.58 | 0.29 | 0.41 |
| (−/+) [1.0 to 1.5] | 0.61 | 0.39 | 0.63 | 0.61 |
| (−/+) [1.5 to 2.0] | 0.78 | 0.21 | 0.78 | 0.78 |
| (−/+) [2.0 to 2.5] | 0.93 | 0.11 | 0.93 | 0.99 |
| >=2.5 and < = −2.5 | 1.0 | 0.18 | 0.97 | 0.81 |
The following bins according to the listed thresholds and the measure TPR, FPR, precision and specificity are averaged values over the test datasets obtained from ten-fold cross-validation:
0.0–0.5: True positives present in the score range of 0.0 to 0.5 and true negative in score range of −0.5 to 0.0. False positives are the complexes from negative dataset (ND) that are predicted positive with a PI-score assigned between 0.0 and 0.5 and false negatives are positive complexes from either positive dataset 1 or 2 (PD1 or PD2) that are predicted as negative with the PI-score between −0.5 and 0.0.
0.5–1.0: True positives present in the score range of 0.5 to 1.0 and true negative in score range of −0.5 to −1.0. False positives are the complexes from negative dataset (ND) that are predicted positive with a PI-score assigned between 0.5 and 1 and false negatives are positive complexes from either positive dataset 1 or 2 (PD1 or PD2) that are predicted as negative with the PI-score between −0.5 and −1.0.
1.0–1.5: True positives present in the score range of 1.0 to 1.5 and true negative in score range of −1.0 to −1.5. False positives are the complexes from negative dataset (ND) that are predicted positive with a PI-score assigned between 1.0 and 1.5 and false negatives are positive complexes from either positive dataset 1 or 2 (PD1 or PD2) that are predicted as negative with the PI-score between −1.0 and −1.5.
1.5–2.0: True positives present in the score range of 1.5 to 2.0 and true negative in score range of −1.5 to −2.0. False positives are the complexes from negative dataset (ND) that are predicted positive with a PI-score assigned between 1.5 and 2.0 and false negatives are positive complexes from either positive dataset 1 or 2 (PD1 or PD2) that are predicted as negative with the PI-score between −1.5 and −2.0.
2.0–2.5: True positives present in the score range of 2.0 to 2.5 and true negative in score range of −2.0 to −2.5. False positives are the complexes from negative dataset (ND) that are predicted positive with a PI-score assigned between 2.0 and 2.5 and false negatives are positive complexes from either positive dataset 1 or 2 (PD1 or PD2) that are predicted as negative with the PI-score between −2.0 and −2.5.
>=2.5 and < =−2.5: True positives with a score > = 2.5 and true negative with score < = −2.5. False positives are the complexes from negative dataset (ND) that are predicted positive with a PI-score >2.5 and false negatives are positive complexes from either positive dataset 1 or 2 (PD1 or PD2) that are predicted as negative with the PI-score < −2.5.
Fig. 3Scoring the interfaces in the oligomeric CASP13 target T1020o.
The target structure is shown in gold in all the panels and the model structures being assessed are shown in red, green and blue. The chains are labelled accordingly. a Target structure within the cryo-EM map. The interface residues from the three chains are shown as grey spheres. b Model TS004_2o, with a positive PI-score for all the three interfaces in the trimeric assembly. c–e Models TS008_4o, TS135_3o and TS208_1o, respectively, for which interfaces are scored negatively with PI-score.
Assessment of interfaces in the models of CASP13 cryo-EM target T1020o.
| Model ID | Model interface | Target interface | iRMSD (Å), fnal | Predicted class | Score |
|---|---|---|---|---|---|
| TS004_2o | AB | AB | 2.2, 0.81 | Positive | 2.6 |
| BC | BC | 2.5, 0.75 | Positive | 2.6 | |
| AC | AC | 2.1, 0.82 | Positive | 2.7 | |
| TS008_4o | AB | AC | 3.16, 0.42 | Negative | −1.5 |
| BC | BC | 2.82, 0.48 | Negative | −1.5 | |
| AC | AB | 2.81, 0.48 | Negative | −1.5 | |
| TS135_3o | AB | BC | 3.08, 0.56 | Negative | −1.6 |
| BC | AB | 3.4, 0.52 | Negative | −1.6 | |
| AC | AC | 3.51, 0.6 | Negative | −1.6 | |
| TS208_1o | AB | BC | 2.6, 0.63 | Not ranked (Interface residues from model 9 and 8) | NA |
| BC | AC | 2.5, 0.54 | Negative | −0.2 | |
| AC | AB | 2.6, 0.52 | Negative | −0.39 |
The model and equivalent target chains forming the interface are listed along with the interface RMSD (iRMSD), fraction of aligned native interface residues (fNal) and predicted class using our model.
Fig. 4Scoring the interfaces in the target T0002 from 2016 EM model challenge.
a Scoring the interfaces between the alpha subunits’ ring in the model structure EM164_1, the chains (F and C), which form a negatively-scoring interface is shown in red and green and the target structure is shown in golden. The surface for the interface forming chains (F and C) are shown as spheres and the loose packing at the interface is marked with black ovals. b Scoring the interfaces in the beta subunits’ ring of the 20S proteasome, one of the targets in EM model challenge (T0002). The target structure is shown in golden and the chains forming the interface (n and d) in the model structure (EM164_1) being assessed are shown in blue and purple. The surface (mesh) of the chains forming interface are shown to highlight the clashes at the interface formed by chains n and d in the model.
Fig. 5Application to the fitted models in EMDB at intermediate-low resolution.
The chains from the crystal structure are in gold and the chains from the modelled structure are in red and green. The interface residues are shown as grey spheres. The plot of the local density-based score (SMOC) is shown for the chains forming an interface in the model and the equivalent chain in the crystal structure. The X-axis is numbered as per residue numbers in the crystal structure. The average SMOC over the model chain is shown as a blue dashed line. a 5 Å resolution structure of Chikungunya virus and the subcomplex envelope1–envelope2 heterodimer (E1–E2) (EMD-5577; fitted PDB: 3J2W). The corresponding 2.35 Å resolution crystal structure is PDB: 3N44 (gold). b 9.8 Å resolution structure of the TFIID subunit 5 and 9 sub-complex (EMD-9302, fitted PDB: 6MZD, cyan and green). The 2.5 Å corresponding crystal structure for subunit5-subunit 9 is PDB: 6F3T (gold). PI-score and CS (the weighted combined score, see ‘Methods’) are listed.