| Literature DB >> 25713596 |
Tristan T Aumentado-Armstrong1,2, Bogdan Istrate1,2, Robert A Murgita3.
Abstract
Interaction sites on protein surfaces mediate virtually all biological activities, and their identification holds promise for disease treatment and drug design. Novel algorithmic approaches for the prediction of these sites have been produced at a rapid rate, and the field has seen significant advancement over the past decade. However, the most current methods have not yet been reviewed in a systematic and comprehensive fashion. Herein, we describe the intricacies of the biological theory, datasets, and features required for modern protein-protein interaction site (PPIS) prediction, and present an integrative analysis of the state-of-the-art algorithms and their performance. First, the major sources of data used by predictors are reviewed, including training sets, evaluation sets, and methods for their procurement. Then, the features employed and their importance in the biological characterization of PPISs are explored. This is followed by a discussion of the methodologies adopted in contemporary prediction programs, as well as their relative performance on the datasets most recently used for evaluation. In addition, the potential utility that PPIS identification holds for rational drug design, hotspot prediction, and computational molecular docking is described. Finally, an analysis of the most promising areas for future development of the field is presented.Entities:
Keywords: Biological databases; Feature selection; Homology; Interface types; Machine learning; Prediction algorithm; Protein structure; Protein-protein binding; Protein-protein interaction; Protein-protein interface
Year: 2015 PMID: 25713596 PMCID: PMC4338852 DOI: 10.1186/s13015-015-0033-9
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Filters used to curate protedatasets for use in training PPIS predictors, including the reasoning behind their use, the methods and specific software used to implement them, as well as references detailing the predictors making use thereof
|
|
|
|
|
|
|---|---|---|---|---|
| Exclusion of non-biological complexes | Avoid training on complexes not present | Check against other database | PQS [ | [ |
| Resolution | Low resolution structures may be inaccurate | PDB filtering | In-House | [ |
| Canonical AAs | Most programs cannot handle non-canonical amino acids | [ | ||
| Redundancy | Reduce overfitting | Sequence similarity cutoff | BLAST [ | [ |
| Removal of members of same superfamily | SCOP [ | [ | ||
| Similarity clustering with representative structure | In-House | [ | ||
| Specialized databases | Pre-filtered databases are more reliable | Use of database | ProtInDB [ | [ |
| Chain Length | Ensure removal of fragments and peptides | PDB filtering; UniPROT [ | In-House | [ |
| Only X-ray Crystal Structures | NMR are harder to validate, less precise, and more difficult to process [ | PDB filtering | In-House | [ |
| No antibody-antigen interactions | Ag-Ab complexes bind on different principles than PPIs [ | [ |
Datasets Used to Evaluate Predictors in Table 4 , including the source from which they were derived, as well as the publication in which they were created using the requirements in the “Description” column
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| A | DB3-188 | DB 3.0 | [ |
| [ | 2010 |
| B | DS56B | CAPRI | [ | Targets 1-27 Bound | [ | 2010 |
| C | DS56U | CAPRI | [ | Targets 1-27 Unbound | [ | 2010 |
| D | NI1 | PDB | [ |
| [ | 2014 |
|
| ||||||
| E | NI2 | PDB | [ |
| [ | 2014 |
|
| ||||||
| F | PlaneDimers | Mintz et al. | [ | Planar PPI; 20 | [ | 2011 |
| Excl. MBPs, | ||||||
| * | Dimers | Mintz et al. | [ | Clustered on seq. similarity; Excl. | [ | 2011 |
| MBPs, | ||||||
| G | TransComp_1 | DB 4.0 | [ | “Simple” (low conf. change); Non-obligate | [ | 2011 |
| * | TransComp_2 | CAPRI | [ | Not in TransComp_1; Non-obligate | [ | 2011 |
| H | W025 | DB 1.0/2.0 | [ | Excl. | [ | 2006 |
| I | S435 | PDB | [ | PQS filtered; | [ | 2007 |
| Excl. NA, MBPs, VS, NMR | ||||||
| J | S149 | PDB | [ | PQS filtered; | [ | 2007 |
| NA, MBPs, VS, NMR; | ||||||
| * | S21a | S149 | [ | Nonredundant; MC | [ | 2007 |
| K | S58 | PDB | [ |
| [ | 2012 |
| Excl. NA, ligands; | ||||||
| L | 3DS | 3did | [ |
| [ | 2012 |
| M | B100 | DB 3.0 | [ | Excl. | [ | 2011 |
| N | BM180 | PDB | [ |
| [ | 2005 |
| NMR; Divided into 4 sub-types | ||||||
| * | S1 | PDB | [ |
| [ | 2009 |
| short; Excl. MBPs, NA; Disprot filtered | ||||||
| * | S2 | PDB | [ |
| [ | 2009 |
| long; Excl. MBPs, NA; Disprot filtered | ||||||
| * | DS24Carl | PDB | [ |
| [ | 2008 |
The “Label” column defines the alphabetic character used to refer to the dataset in Table 4. “ ∗” in the “Label” column signifies that the set is not presented in Table 4 as it is not widely used. is the sequence identity redundancy cutoff, is the amino acid length of the chain, is the resolution cutoff in angstroms, N 100≥n requires that the number of interface residues per 100 residues in a given protein to be greater than n, Ag-Ab refers to antigen-antibody complexes, M S A is the sequence identity redundancy cutoff for chains in an MSA, VS refers to Viral Subunits, NA refers to Nucleic Acids, N (x) refers to a set being non-homologous to set x, MC denotes that both the monomer and the complex to which it belongs are known.
Comparative evaluation of recent predictors
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|
| ProMate | 2004 | A | DB3-188 | 36.5 | 30.3 | 77.1 | 67.7 | 19.5 | 33.1 | [ |
| B | DS56B | 31.9 | 27.3 | 76.7 | 63.3 | 15.6 | 29.4 | |||
| C | DS56U | 28.7 | 27.3 | 76.6 | 62.7 | 14.0 | 28.0 | |||
| E | NI2 | 40.0 | 93.9 | ∗ | ∗ | 13.6 | 56.1 | [ | ||
| F | PlaneDimers | ∗ | ∗ | ∗ | 68.0 | 18.0 | ∗ | [ | ||
| G | TransComp_1 | ∗ | ∗ | ∗ | 70.0 | 20.0 | ∗ | |||
| ConsPPISP | 2005 | A | DB3-188 | 46.5 | 30.6 | 80.4 | 73.2 | 26.7 | 36.9 | [ |
| B | DS56B | 39.8 | 36.1 | 78.9 | 72.6 | 25.2 | 37.9 | |||
| C | DS56U | 37.4 | 34.5 | 79.5 | 71.2 | 23.8 | 35.9 | |||
| E | NI2 | 49.3 | 32.2 | ∗ | ∗ | 14.7 | 39.0 | [ | ||
| PINUP | 2006 | A | DB3-188 | 40.7 | 34.7 | 78.3 | 66.0 | 24.6 | 37.5 | [ |
| B | DS56B | 37.3 | 31.9 | 78.4 | 63.7 | 21.7 | 34.4 | |||
| C | DS56U | 30.4 | 30.1 | 76.9 | 60.0 | 16.4 | 30.2 | |||
| E | NI2 | 52.9 | 28.5 | ∗ | ∗ | 15.1 | 37.0 | [ | ||
| WHISCY | 2006 | H | W025 | 39.0 | 27.0 | ∗ | ∗ | 27.0 | ∗ | [ |
| metaPPISP | 2007 | A | DB3-188 | 49.0 | 26.7 | 81.1 | 74.6 | 26.2 | 34.6 | [ |
| B | DS56B | 43.3 | 25.8 | 80.8 | 74.4 | 22.9 | 32.3 | |||
| C | DS56U | 38.9 | 24.0 | 81.1 | 71.5 | 20.2 | 29.7 | |||
| E | NI2 | 54.7 | 25.5 | ∗ | ∗ | 16.6 | 34.8 | [ | ||
| F | PlaneDimers | ∗ | ∗ | ∗ | 54.0 | 4.0 | ∗ | [ | ||
| G | TransComp_1 | ∗ | ∗ | ∗ | 78.0 | 31.0 | ∗ | |||
| PIER | 2007 | E | NI2 | 44.1 | 83.6 | ∗ | ∗ | 23.0 | 57.7 | [ |
| SPPIDER | 2007 | J | S149 | 63.7 | 60.3 | ∗ | 76.0 | 42.0 | ∗ | [ |
| F | PlaneDimers | ∗ | ∗ | ∗ | 80.0 | 33.0 | ∗ | [ | ||
| G | TransComp_1 | ∗ | ∗ | ∗ | 68.0 | 15.0 | ∗ | |||
| Sikic | 2009 | L | 3DS | 63.4 | 78.3 | 65.3 | ∗ | 30.8 | ∗ | [ |
| PredUs | 2010 | A | DB3-188 | 43.6 | 45.7 | ∗ | ∗ | ∗ | ∗ | [ |
| B | DS56B | 41.5 | 42.2 | ∗ | ∗ | ∗ | ∗ | |||
| C | DS56U | 39.8 | 44.6 | ∗ | ∗ | ∗ | ∗ | |||
| 2011 | A | DB3-188 | 50.3 | 57.5 | 72.6 | 73.9 | 34.5 | 53.0 | [ | |
| B | DS56B | 43.0 | 53.0 | 72.1 | 71.3 | 29.0 | 47.4 | |||
| C | DS56U | 43.3 | 53.6 | 73.2 | 72.9 | 30.4 | 47.9 | |||
| K | S58 | 45.5 | 57.6 | 78.5 | ∗ | 37.7 | 50.8 | [ | ||
| VORFFIP | 2011 | J | S149 | 63.4 | 74.7 | ∗ | 90.0 | 58.0 | ∗ | [ |
| H | W025 | 42.0 | 47.0 | ∗ | ∗ | 38.0 | ∗ | |||
| M | B100 | 45.0 | 56.0 | ∗ | ∗ | 42.0 | 49.0 | |||
| HomPPI | 2011 | N | BM180 1 | ∗ | 58.0 | 85.0 | ∗ | 44.0 | ∗ | [ |
| BM180 2 | ∗ | 48.0 | 84.0 | ∗ | 42.0 | ∗ | ||||
| BM180 3 | ∗ | 71.0 | 86.0 | ∗ | 60.0 | ∗ | ||||
| BM180 4 | ∗ | 73.0 | 91.0 | ∗ | 65.0 | ∗ | ||||
| PrISE | 2012 | A | DB3-188 | 48.0 | 43.2 | 80.6 | 77.2 | 33.8 | 45.5 | [ |
| B | DS56B | 46.1 | 45.4 | 80.9 | 77.6 | 34.1 | 45.7 | |||
| C | DS56U | 43.7 | 44.0 | 81.2 | 75.5 | 32.6 | 43.8 | |||
| Li mRMR-IFS | 2012 | L | 3DS | 65.3 | 79.0 | 67.3 | ∗ | 34.8 | ∗ | [ |
| Chen PDM-ML | 2012 | I | S435 | 51.2 | 66.2 | 75.9 | ∗ | 42.0 | 57.8 | [ |
| J | S149 | 51.9 | 67.7 | 75.3 | ∗ | 42.3 | 58.8 | |||
| K | S58 | 44.6 | 65.4 | 77.7 | ∗ | 40.3 | 53.0 | |||
| PresCont | 2012 | F | PlaneDimers | ∗ | ∗ | ∗ | 80.0 | 33.0 | ∗ | [ |
| G | TransComp_1 | ∗ | ∗ | ∗ | 69.0 | 17.0 | ∗ | |||
| RAD-T | 2014 | A | DB3-188 | 28.5 | 64.7 | 65.2 | ∗ | 22.2 | 35.5 | [ |
| D | NI1 | 33.8 | 80.5 | 51.8 | ∗ | 20.1 | 46.4 | |||
| E | NI2 | 44.7 | 80.9 | 59.1 | ∗ | 26.4 | 57.6 |
refers to the DB3-188 set excluding 2VIS; refers to the S435 set excluding 3 proteins due to obsolescence or absence; refers to the S149 set excluding 7 proteins due to existence in training set. BM180 1 are transient enzyme-inhibitor complexes, BM180 2 are transient non-enzyme- inhibitor complexes, BM180 3 are obligate hetero-dimers and BM180 1 are obligate homo-dimers.
Compilation of selected software/methods used to compute features described in Features section, including the predictors utilizing each and the publications describing their utility and recommending their use
|
|
|
|
|
|---|---|---|---|
| Accessible Surface Area | PSAIA [ | Li [ | Chen [ |
| Conservation | HSSP [ | Li [ | Zhou & Shan [ |
| Depth Index | DPX [ | Sikic [ | Sikic [ |
| Protrusion | CX [ | Sikic [ | Jones & Thornton [ |
| Hydrophobicity | PSAIA [ | Sikic [ | Neuvirth [ |
| Secondary Structure | DSSP [ | Sikic [ | Neuvirth [ |
| Propensity | Dong [ | Dong [ | Conte [ |
| Disorder | VSL2 [ | Li [ | Wright [ |
| Curvature | Coleman method [ | Li [ | Jones & Thornton [ |
| B-Factors | Curated from PDB [ | RAD-T [ | Ezkurdia [ |
| Electrostatic Potential | APBS [ | RAD-T [ | RAD-T [ |
| Side-chain Conformational Entropy | FoldX [ | VORFFIP [ | Cole & Warwicker [ |
| Residue Contact Frequencies | PredUs [ | PredUs [ | PredUs [ |
| Atomic Probability Density Map Features | Yu [ | Chen [ | Chen [ |
| Energy of Solvation | Fernandez-Recio method [ | Fiorucci [ | Fiorucci [ |