| Literature DB >> 35679182 |
Vincenzo Laveglia1, Andrea Giachetti1, Davide Sala1,2,3, Claudia Andreini1,3,4, Antonio Rosato1,3,4.
Abstract
Thirty-eight percent of protein structures in the Protein Data Bank contain at least one metal ion. However, not all these metal sites are biologically relevant. Cations present as impurities during sample preparation or in the crystallization buffer can cause the formation of protein-metal complexes that do not exist in vivo. We implemented a deep learning approach to build a classifier able to distinguish between physiological and adventitious zinc-binding sites in the 3D structures of metalloproteins. We trained the classifier using manually annotated sites extracted from the MetalPDB database. Using a 10-fold cross validation procedure, the classifier achieved an accuracy of about 90%. The same neural classifier could predict the physiological relevance of non-heme mononuclear iron sites with an accuracy of nearly 80%, suggesting that the rules learned on zinc sites have general relevance. By quantifying the relative importance of the features describing the input zinc sites from the network perspective and by analyzing the characteristics of the MetalPDB datasets, we inferred some common principles. Physiological sites present a low solvent accessibility of the aminoacids forming coordination bonds with the metal ion (the metal ligands), a relatively large number of residues in the metal environment (≥20), and a distinct pattern of conservation of Cys and His residues in the site. Adventitious sites, on the other hand, tend to have a low number of donor atoms from the polypeptide chain (often one or two). These observations support the evaluation of the physiological relevance of novel metal-binding sites in protein structures.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35679182 PMCID: PMC9241070 DOI: 10.1021/acs.jcim.2c00522
Source DB: PubMed Journal: J Chem Inf Model ISSN: 1549-9596 Impact factor: 6.162
Figure 1Scheme of the classifier. The network is composed by three modules. The convolutional module processes the input data, and its outcome is then fed to the recurrent module; finally, the fully connected module generates the estimated class probabilities for the input site.
Confusion Matrix of the Performances for the Test Sets Averaged over the 10-Fold Cross Validation Procedurea
| estimated physiological | estimated adventitious | |
|---|---|---|
| real physiological | 1615 TP | 329 FN |
| real adventitious | 208 FP | 3144 TN |
Each row corresponds to the data points belonging to a certain class (“real” class, corresponding to physiological/adventitious zinc(II) sites in this work), whereas the columns show how the model classified the points (“estimated” class).
Performance Metrics Derived from the Results of Table
| metric | value | formula | meaning |
|---|---|---|---|
| PPV | 0.886 | TP/(TP + FP) | fraction of positive predictions that are correct |
| recall, TPR | 0.831 | TP/(TP + FN) | fraction of all positive sites that are correctly classified |
| NPV | 0.905 | TN/(TN + FN) | fraction of negative predictions that are correct |
| specificity, TNR | 0.938 | TN/(TN + FP) | fraction of all negative sites that are correctly classified |
| FDR | 0.114 | 1-PPV | fraction of positive predictions that are wrong |
| MCC | 0.780 | Matthews’ correlation coefficient |
Figure 2(Top) Number of predictions for zinc(II) sites with a given confidence (absolute value of the difference between the score of the positive and of the negative classes). (Bottom) Error rate of the neural classifier in each confidence range. The data have been computed using 0.1 bins.
Figure 3Importance of the input features. The plot shows the decrease in classification accuracy caused by the perturbation of the input features of the test sets, measured by the importance parameter (see Methods) averaged over the ten folds. The 20 amino acids were perturbed individually. Features describing the binding role of the residues and their secondary structure were merged.
Average Accuracy over the Test Sets of the 10 Fold Cross-Validation Procedure for Neural Networks Trained with a Subset of the 29 Features
| features | test accuracy | standard deviation |
|---|---|---|
| conservation of Cys, His, Asn, and Thr | 80.9 | 1.5 |
| binding role | 84.5 | 1.5 |
| solvent accessibility | 64.0 | 4.9 |
| conservation of C, H, N, T plus binding role | 86.6 | 1.6 |
| conservation of C, H, N, T plus solvent accessibility | 75.7 | 3.3 |
| binding role plus solvent accessibility | 84.2 | 1.9 |
Figure 4Data visualization generated with the TSNE dimensionality reduction algorithm. This algorithm produces a representation in an arbitrary 2D space of the distance between the points in the original multidimensional space of the data representation of the neural network. The points (red: physiological sites; blue: adventitious sites) are colored according to the (top) known class and (center) predicted class. In the bottom panel, the points are colored based on the average absolute solvent accessibility of the protein residues providing the donor atoms to the zinc(II) ion(s) regardless of their classification.
Confusion Matrix for the Classification of Non-Heme Mononuclear Iron Sites by the Zinc(II) Neural Classifier
| estimated P | estimated N | |
|---|---|---|
| real P | 246 TP | 67 FN |
| real N | 30 FP | 108 TN |
Figure 5Comparison of value ranges (adventitious vs physiological) for a selection of the features defined for all zinc-binding sites. (A) Number of amino acids binding the metal (“metal ligands”). (B) Average absolute solvent accessibility of the metal ligands. (C) Number of residues in the site. (D) Average absolute solvent accessibility of the residues in the second coordination sphere. Red empty boxes: adventitious sites; blue hatched boxes: physiological sites. Box plot setup: the box goes from the 25th to the 75th percentile (1st and 3rd quartile, respectively); whiskers are at the 5th and 95th percentile; the minimum and maximum values are shown by crosses; the square in the box corresponds to the mean value.