| Literature DB >> 28968673 |
Thomas C Northey1, Anja Barešić1, Andrew C R Martin1.
Abstract
MOTIVATION: Protein-protein interactions are vital for protein function with the average protein having between three and ten interacting partners. Knowledge of precise protein-protein interfaces comes from crystal structures deposited in the Protein Data Bank (PDB), but only 50% of structures in the PDB are complexes. There is therefore a need to predict protein-protein interfaces in silico and various methods for this purpose. Here we explore the use of a predictor based on structural features and which exploits random forest machine learning, comparing its performance with a number of popular established methods.Entities:
Year: 2018 PMID: 28968673 PMCID: PMC5860208 DOI: 10.1093/bioinformatics/btx585
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Residue geometry and solvent vectors. A candidate atom (red) is within the contact distance of a patch atom (purple). The residue geometry vectors (white) are used to calculate solvent vectors (black) and the angle between them is calculated. Because the angle is > 120°, the candidate atom is not included in the patch
Fig. 2.An example interface site (bordered in yellow), an interface patch (cyan) and a rim patch (magenta). The fraction of the rim patch’s surface involved in the interface is not high enough for the patch to be labelled as interface. See Equation 3
Summary of IntPred features
| Feature | Description | Type |
|---|---|---|
| prop | propensity score | Continuous numeric |
| hpho | hydrophobicity | Continuous numeric |
| homology | homology conservation score | Continuous numeric |
| FEP | FEP conservation score | Continuous numeric |
| SS | disulphide bonds | Continuous numeric |
| Hb | hydrogen bonds | Continuous numeric |
| helix (H) | Binary categorical | |
| sheet (E) | Binary categorical | |
| mix (EH) | mixed secondary Structure | Binary categorical |
| coil (C) | coil secondary Structure | Binary categorical |
| pln | planarity | Continuous numeric |
| intf | Output class label | Binary categorical |
Note: See text for description of how these features are calculated.
Random forest performance
| Attributes | Performance | |||||||
|---|---|---|---|---|---|---|---|---|
| Patch radius | C FEP | CHOM | ACC | PREC | SPEC | SENS | MCC | F |
| SR | 0.755 | 0.537 | 0.194 | 0.208 | 0.285 | |||
| SR | 0.749 | 0.502 | 0.939 | 0.184 | 0.184 | 0.269 | ||
| SR | 0.737 | 0.453 | 0.913 | 0.213 | 0.170 | 0.290 | ||
| SR | 0.710 | 0.370 | 0.875 | 0.218 | 0.114 | 0.274 | ||
| 9 | 0.760 | 0.679 | 0.906 | 0.439 | 0.398 | 0.533 | ||
| 9 | 0.752 | 0.665 | 0.906 | 0.413 | 0.373 | 0.509 | ||
| 9 | 0.750 | 0.651 | 0.894 | 0.433 | 0.374 | 0.520 | ||
| 9 | 0.733 | 0.608 | 0.881 | 0.405 | 0.327 | 0.486 | ||
| 14 | 0.894 | |||||||
| 14 | 0.780 | 0.725 | 0.888 | 0.573 | 0.492 | 0.640 | ||
| 14 | 0.780 | 0.718 | 0.882 | 0.582 | 0.492 | 0.643 | ||
| 14 | 0.764 | 0.691 | 0.871 | 0.555 | 0.453 | 0.616 | ||
Note: CFEP =conservation score calculated over functionally equivalent proteins from FOSTA, CHOM =conservation scores calculared from homologues collected by a BLAST search of UniProtKB/SwissProt. Structural attributes were used in all instances. SR, single-residue patches; ACC, accuracy; PREC, precision; SPEC, specificity; SENS, sensitivity; MCC, Matthews’ correlation coefficient; F, F-measure. The highest score in every column is shown in bold. M (the number of randomly chosen attributes in every split) was set to 3 and T (the number of trees) was set to 100 in all cases, these having been found to provide the best performance (data not shown). All scores are averages over 10-folds of cross-validation.
Benchmarking of IntPred and other previously published general PPI methods using an independent test set
| Method | ACC | PREC | SPEC | SENS | MCC | F |
|---|---|---|---|---|---|---|
| ProMate | 0.780 | 0.401 | 0.031 | 0.058 | 0.057 | |
| PIER | 0.754 | 0.511 | 0.932 | 0.214 | 0.207 | 0.302 |
| SPPIDER | 0.759 | 0.472 | 0.783 | |||
| PINUP | 0.772 | 0.459 | 0.927 | 0.220 | 0.199 | 0.298 |
| meta-PPISP | 0.755 | 0.499 | 0.902 | 0.300 | 0.245 | 0.375 |
| IntPred | 0.916 | 0.411 | 0.370 | 0.473 | ||
| IntPred (patch) | 0.771 | 0.803 | 0.922 | 0.522 | 0.500 | 0.633 |
Note: ACC, accuracy; PREC, precision; SPEC, specificity; SENS, sensitivity; MCC, Matthews’ correlation coefficient; F, F-measure. The highest score in every column is shown in bold. IntPred refers to the random forest model trained on all features and 14 Å-radius patches mapped to a residue-level prediction while IntPred (patch) refers to performance at the patch level.
Comparison of the performance of methods (assessed by MCC) on obligate and transient complexes
| MCC | ||
|---|---|---|
| Method | Obligate complexes | Transient complexes |
| ProMate | 0.037 | 0.166 |
| PIER | 0.288 | 0.217 |
| SPPIDER | 0.426 | 0.311 |
| PINUP | 0.205 | 0.235 |
| meta-PPISP | 0.257 | 0.268 |
| IntPred | 0.381 | 0.303 |
Note: Overall performance is show in Table 3.