Literature DB >> 31199455

iScore: a novel graph kernel-based function for scoring protein-protein docking models.

Cunliang Geng¹, Yong Jung^2,3,4, Nicolas Renaud⁵, Vasant Honavar^{2,3,4,6,7,8,9}, Alexandre M J J Bonvin¹, Li C Xue¹.

Abstract

MOTIVATION: Protein complexes play critical roles in many aspects of biological functions. Three-dimensional (3D) structures of protein complexes are critical for gaining insights into structural bases of interactions and their roles in the biomolecular pathways that orchestrate key cellular processes. Because of the expense and effort associated with experimental determinations of 3D protein complex structures, computational docking has evolved as a valuable tool to predict 3D structures of biomolecular complexes. Despite recent progress, reliably distinguishing near-native docking conformations from a large number of candidate conformations, the so-called scoring problem, remains a major challenge.
RESULTS: Here we present iScore, a novel approach to scoring docked conformations that combines HADDOCK energy terms with a score obtained using a graph representation of the protein-protein interfaces and a measure of evolutionary conservation. It achieves a scoring performance competitive with, or superior to, that of state-of-the-art scoring functions on two independent datasets: (i) Docking software-specific models and (ii) the CAPRI score set generated by a wide variety of docking approaches (i.e. docking software-non-specific). iScore ranks among the top scoring approaches on the CAPRI score set (13 targets) when compared with the 37 scoring groups in CAPRI. The results demonstrate the utility of combining evolutionary, topological and energetic information for scoring docked conformations. This work represents the first successful demonstration of graph kernels to protein interfaces for effective discrimination of near-native and non-native conformations of protein complexes.
AVAILABILITY AND IMPLEMENTATION: The iScore code is freely available from Github: https://github.com/DeepRank/iScore (DOI: 10.5281/zenodo.2630567). And the docking models used are available from SBGrid: https://data.sbgrid.org/dataset/684). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Mesh：

Substances：
Proteins

Year: 2020 PMID： 31199455 PMCID： PMC6956772 DOI： 10.1093/bioinformatics/btz496

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Protein–protein interactions (PPIs) play a crucial role in most cellular processes and activities such as signal transduction, immune response, enzyme catalysis, etc. Getting insight into the three dimensional (3D) structures of those protein–protein complexes is fundamental to understand their functions and mechanisms (Aloy and Russell, 2006; Kiel ). Due to the prohibitive cost and effort involved in experimental determination of the structure of protein complexes (Shoemaker and Panchenko, 2007), computational modelling, and in particular docking, has established itself as a valuable complementary approach to obtaining insights into structural basis of protein interactions, interfaces and complexes (Halperin ; Huang, 2014; Melquiond ; Rodrigues and Bonvin, 2014; Soni and Madhusudhan, 2017; Stein ; Vangone ). Computational docking typically involves two steps (Halperin ; Huang, 2014; Rodrigues and Bonvin, 2014; Soni and Madhusudhan, 2017): Sampling, i.e. the search of the interaction space between two molecules to generate as many as possible near-native models; and scoring, i.e. the identification of near-native models out of the pool of sampled conformations. As shown in the community-wide Critical Assessment of PRediction of Interactions (CAPRI) (Lensink and Wodak, 2010; 2013; Lensink , 2017), scoring is still a major challenge in the field. There is thus still plenty of room to improve the scoring functions used in protein–protein docking (Moal ; Vangone ). Scoring functions can be classified into three types: (i) physical energy term-based, (ii) statistical potential-based and (iii) machine learning-based. Physical energy-based scoring functions are usually a weighted linear combination of multiple energetic terms. These are widely used in many docking programs such as HADDOCK (Dominguez ; Vangone ), SwarmDock (Torchala ), pyDock (Cheng ; Grosdidier ; Jiménez-García ), ZDock (Pierce ; Pierce and Weng, 2007) and ATTRACT (Zacharias, 2003). Taking HADDOCK as an example, its scoring function consists of intermolecular electrostatic and van der Waals energy terms combined with an empirical desolvation potential (Fernández-Recio ) as well as a buried surface area (BSA)-based term depending on the stage of the protocol (Vangone ). Statistical potential-based scoring functions such as 3D-Dock (Moont ), DFIRE (Zhou and Zhou, 2002) and SIPPER (Pons ), typically convert distance-dependent pairwise atom–atom or residue–residue contacts distributions into potentials through Boltzmann inversion. Unlike classical scoring functions that consist of linear combinations of energy terms, or simple geometric and physicochemical features (Bourquard ; Fink ; Moal ), a machine learning approach can discover complex nonlinear combinations of features of protein–protein interfaces to train a classifier to label a docking model as near-native model or not. Simple machine learning algorithms work with fixed dimensional feature vectors. Because interfaces of different docking models can vary widely in size and shape, and in the arrangement of their interfacial residues, most machine learning based scoring functions typically use global features of the entire interface, for example, the total interaction energy and the BSA. However, such an approach fails to effectively utilize details of the spatial arrangement of interfacial residues/atoms. Graphs, in which the nodes encode the amino acid residues or atoms and the intermolecular contacts between them are encoded by the edges, offer a natural and information-rich representation of protein–protein interfaces. Unlike the global interface feature vectors described above, a graph has a residue- or atom-level resolution and naturally encodes the topological information of interacting residues/atoms (Bunke and Riesen, 2011; Vento, 2015). Furthermore, the size of a graph is not fixed and can vary depending on the size of the interface. Such graph-based descriptions have been used previously in several scoring functions (Chang ; Khashan ; Pons ). Graph (or network) topology-based metrics have mostly been used. Chang exploited node degrees (measuring the number of direct contacts of a node) and clustering coefficients (measuring how likely a node and its neighbours tend to form a clique) to score docking models. Similarly, Pons used closeness (measuring how far a node from the rest of the nodes in a network) and betweenness (measuring how important a node as a connector in a network) in scoring with the intuition that residues with high centralities in a network tend to be key functional residues. Unlike the network topology-based approaches, the SPIDER (Khashan ) scoring function uses a graph to represent the interface at residue level with nodes labeled by their amino acid identity. It ranks the docking models by counting the frequency of native motifs in the interface graph. However, all the preceding fail to fully exploit the rich features of protein interfaces. Against this background, we represent the interface with a labeled graph, where the nodes encode the interface residues, edges encode residue–residue contacts, and the nodes are annotated with evolutionary conservation profiles. We treat the scoring problem as a binary classification problem. By calculating the similarity between an interface graph from a docking model with the positive (native) and negative (non-native) interface graphs in the training set, we predict the likelihood of the query interface graph belonging to the positive class or the negative class (Fig. 1). We make use of a novel graph kernel to compute the pair-wise similarity between the graph representations of protein–protein interfaces. We call the resulting graph kernel-based scoring function GraphRank.

Fig. 1.

Schematic workflow of our graph kernel-based scoring method. Docking models for a protein–protein complex are first represented as graphs by treating the interface residues as graph nodes and the intermolecular contacts they form as graph edges. Interface features are added to the graph as node or edge labels (only PSSM profiles as node labels in this case). Then, each of the interface graphs of the docking models is compared to the interface graphs of both the positive (native) structure and negative (non-native) models. This graph comparison generates a similarity matrix for the docking models with the number of rows and columns corresponding to the number of docking models and the total number of positive and negative graphs, respectively. Next, the support vector machine takes the graph kernel matrix as input and predicts decision values that are used as the GraphRank score. The final scoring function iScore is a linear combination of the GraphRank score and HADDOCK energetic terms (van der Waals, electrostatic and desolvation energies). The weights of this linear combination are optimized using the genetic algorithm (GA) over the BM4 HADDOCK dataset GraphRank exploits random walk graph kernel (RWGK) (Vishwanathan ) for computing the similarity of labeled graphs, which has previously been used for protein function prediction (Borgwardt ) to calculate the similarity between two interface graphs. By simultaneously conducting random walks on two graphs, RWGK measures the similarity of two graphs by aggregating the similarity of the set of random walks on the two graphs. Unlike previous graph-based scoring functions, RWGK allows GraphRank to fully exploit various node labels and edge labels and to explicitly specify the starting and ending probability of the random walks. GraphRank has two major advantages over classical machine learning based scoring functions. First, GraphRank uses a more detailed representation of protein interfaces than that provided by the fixed dimensional feature vectors used by classical machine learning approaches. GraphRank exploits residue level attributes and network topology. Second, GraphRank uses the full profile of interface conservation as node labels, i.e. each node is represented as a 20 by 1 vector of conservation profile extracted from the Position Specific Scoring Matrix (PSSM). Residue conservation information plays an important role in protein–protein recognitions (Andreani and Guerois, 2014; de Oliveira and Deane, 2017; Hopf ) and hence different types of conservation information have been used in several existing scoring functions (Andreani ; Tress ; Xue ). The PSSM is a multiple-sequence-alignment (MSA) based conservation matrix. Its value is a log likelihood ratio between the observed probability of one type of amino acid appearing in a specific position in the MSA and the expected probability of that amino acid type appearing in a random sequence. Each position in a protein can be represented as a 20 by 1 PSSM profile, which captures the conservation characteristic of each amino acid type at a specific position. For GraphRank we designed a specific random walk graph kernel to compare interface graphs. A graph similarity matrix was calculated from a balanced dataset of native and non-native structures from the protein–protein docking benchmark version 4.0 (BM4) (Hwang ), and was used to train a support vector machine (SVM) classifier. GraphRank, the resulting scoring function, uses only the residue conservation information as node labels and as the basis of starting and ending probabilities of random walks. We further combined the GraphRank score with intermolecular energies, resulting our final scoring function, iScore. We benchmarked the iScore and GraphRank scoring functions on two independent sets of docking models for two different purposes: (i) 4 sets of docking software-specific models and their respective scoring functions and (ii) the CAPRI score set, a set of docking software-nonspecific models, in which models from different docking programs are mixed together. We also compare our performance with that of IRaPPA (Moal ), one of the latest state-of-the-art machine learning based scoring functions. The results of our experiments on both benchmark sets show that iScore achieves scoring performance that is competitive with or superior to that of the state-of-the-art scoring functions. These results represent the first successful demonstration of the use of graph kernel applied to protein interfaces for effective discrimination of near-native and non-native conformations of protein complexes.

2 Materials and methods

2.1 Constructing interface graph and random walk graph kernel

2.1.1 Representing protein–protein interfaces as labeled bipartite graphs

A residue is defined as an interface residue if any of its atoms is within 6 Å of any atom of another residue in the partner protein. This is a commonly used interface definition (Xue ), and, for example, a similar cutoff (5.5 Å) has been shown to work well for contacts-based binding affinity prediction (Vangone and Bonvin, 2015). We represent the interface of a native complex or a docking model as a bipartite graph (Fig. 1), in which each node is an interface residue, and each edge consists of two nodes that are within 6 Å distance from each other (based on any atom–atom distance within 6 Å between those residues). We further label the graph node with residue conservation profiles from Position Specific Scoring Matrix (PSSM). Each node is thus represented by a 20 1 vector of PSSM profile. Our current implementation uses a single type of nodes, namely residues, labeled with their PSSM profiles, and a single type of edges, namely, those that encode inter-residue contacts. However, our framework admits multiple types of nodes and edge labels. The PSSM was calculated through PSI-BLAST (Altschul, 1997) of BLAST 2.7.1+. The parameters of the BLAST substitution matrix, word size, gap open cost and gap extend cost were automatically set based on the length of protein sequence using the recommended values in the BLAST user guide (https://www.ncbi.nlm.nih.gov/books/NBK279684/) (see Supplementary Table S1). Other parameters were: Number of iterations set to 3 and the e-value threshold to 0.0001. The BLAST database used was the nr database (the non-redundant BLAST curated protein sequence database), version of February 4, 2018.

2.1.2 Random walk graph kernel for interface graphs

We define a random walk graph kernel (RWGK) to measure the similarity of two interface graphs. Given two labeled graphs, a RWGK first applies simultaneous random walks on the two graphs with the same walk length (the number of edges) and then calculates the similarity between those two random walks. The RWGK score is then the weighted sum of the walk similarity varying the walk length from 0 to infinity (Ghosh ). Gärtner ) proposed an elegant approach for calculating all random walks within two graphs using direct product graphs. A graph consists of a set of nodes and a set of edge where the edge is defined by two nodes. Given two graphs and , the direct product graph is a graph defined as follows: where is the node set and is the edge set. In other words, is a graph over pairs of nodes from and , and two nodes in are neighbors if and only if the corresponding nodes in and are both neighbors (Vishwanathan ). The simultaneous random walks on graphs and are equivalent to a random walk on the direct product graph . In other words, each walk on the direct product graph corresponds to two walks on the two individual graphs, allowing the calculation of a similarity score between them. When the walk length is 1, these similarity scores are the elements of the weight matrix of consists of similarity scores of walk length of . The similarity between graphs and is thus the weighted sum of these walk similarities. Formally, the random walk graph kernel is originally defined by Vishwanathan as: where is the length of random walk on , is a factor that allows one to (de-)emphasize walks with different lengths, is the weight matrix of , and and are the starting and stopping probabilities of random walks on , respectively. In our study, we limit the maximum walk length to 3, and is set to 1 for = 0 to 3. And , and are designed as follows. where is the kernel to measure the similarity between two edges, and . Since we do not use specific edge labels here, is simply set to 1. is the kernel to measure similarity between nodes defined as follows: where and are node labels for nodes and , respectively. As described above, we used PSSM residue conservation profiles as node label. was set to 10 by simply checking the distribution of some values. We bias the random walks to start and end with conserved residues by giving those higher starting and ending probabilities. For this, we define the starting and ending probabilities and from the normalized conservation score as follows: where and are the PSSM information content (IC) for the nodes and , respectively, and and are the numbers of nodes in graph and , respectively. IC is always ≥0. The higher the IC, the more conserved a residue is.

2.1.3 Support vector machine (SVM) algorithm

SVM maps arbitrary data objects (vectors, sequences, graphs, etc.) into a kernel-induced feature space where it searches for a hyperplane that maximizes (or approximately maximizes) the separation between classes (Vapnik, 2013). We used the SVM implementation from the LIBSVM (Chang and Lin, 2011) package to train a scoring function taking the graph kernel matrix from the training dataset as input ( is the number of the training graphs). Given a test data (an interface graph of a docking model in our case), we calculate the kernel vector that consists of the similarities of this query graph with all the training graphs. The trained SVM-based scoring model uses the resulting vector of similarities of the query graph with all of the training graphs as well as the labels of the training graphs to predict the likelihood of the query graph corresponds to a near-native conformation.

2.2 Evaluation metrics to compare scoring functions

Each scoring function has its own default protocol for selecting top models. To avoid subjectivity in the selection of top models in our comparisons, we used the success rate at cluster level to evaluate the scoring functions on the BM5 dataset. We defined a cluster as a hit if at least one of the top four models in that cluster is of acceptable or better quality. The success rate on top N clusters was defined as the number of cases (complexes) with at least one hit out of the N clusters divided by the total number of complexes considered. The quality of the docking models was evaluated using standard CAPRI criteria based on the interface or ligand Root Mean Squared Deviations (i-RMSDs and l-RMSDs, respectively) and fraction of native contacts (Fnat) [for details refer to Figure 1 of Lensink )]. They were classified as incorrect (i-RMSD > 4 Å or Fnat < 0.1), acceptable (2 Å < i-RMSD ≤ 4 Å and Fnat ≥ 0.1), medium (1 Å < i-RMSD ≤ 2 Å and Fnat ≥ 0.3) or high (i-RMSD ≤ 1 Å and Fnat ≥ 0.5) quality (Lensink ). Comparison of GraphRank and iScore with CAPRI best performing group per target on the CAPRI score set Note: 10 models are selected and evaluated. The values are labeled in green/red when the performance of our scoring functions is better/worse than the CAPRI best scoring group. The scoring performance for each target is reported as the number of acceptable or better models (hits), followed by the number of high (indicated with ***) or medium quality models (**). For example, 8/2** means that there are totally 8 hits among the top 10 models, 2 models out of which are medium-quality models. The overall performance of each method on all 13 targets (the last row) is reported in a similar way. For example, 9/2***/5** means that a scoring function is successful in 9 targets, 2 targets out of 9 have at least a *** model and 5 out of 9 have at least a ** model in the top 10. Note that the CAPRI best column consists of results from 37 different groups (refer to Table 3 for a comparison of the performance per group and Supplementary Table S3 per target).

Table 3.

Rankings of GraphRank and iScore in comparison with the scorer groups on the CAPRI score set

	Performance	# Submitted targets
iScore	9/2*/5	13
Weng	8/3*/2	9
Bonvin	8/2*/3	9
Bates	8/1*/4	10
GraphRank	8/1*/4	13
Zou	7/4*/1	9
Wang	6/2*/3	6
Fernandez-Recio	5/2*/3	8
Elber	5/1*/1	5
Wolfson	4/1***	5
Camacho	3/2*/1	5
… and many others

Note: In total 37 scorer groups were assessed (Supplementary Table S3), but only scorer groups that have submitted predictions for at least 5 out of the 13 CAPRI targets are shown here. The scoring functions/groups are ordered based on their performance. Number of targets with submitted predictions are shown for each function/group.

2.3 Training on docking benchmark 4 docking models

2.3.1 Training dataset for GraphRank

The dataset for training was based on protein–protein complexes from the protein–protein docking benchmark version 4.0 (BM4), considering only dimers and excluding antibody complexes, resulting in a set of 117 non-redundant protein–protein complexes. Docking models for those complexes had been generated previously by running HADDOCK in its ab initio mode using center of mass restraints (Karaca ). The crystal structures of these 117 complexes (the ‘native’ structures) form our positive training set. The average number of nodes and edges in the corresponding graphs for this native set are 68 ± 25 and 119 ± 55, respectively. To create a balanced training set, we randomly selected 117 non-native (wrong) models from the pool of HADDOCK models with i-RMSD ≥ 10 Å and number of graph nodes ≥5 as our negative training set. The average number of nodes and edges in the non-native set are 48 ± 14 and 70 ± 23, respectively. In total, we thus have 234 (=117*2) structures as our training set.

2.3.2 Training dataset for iScore

For the training of iScore we selected BM4 complexes for which HADDOCK, running in ab initio mode using center of mass restraints, generated at least one good model in the final water refinement stage. This resulted in 63 cases for which at least one docking model with acceptable or better quality was present in the final set of 400 water-refined models. This dataset is denoted in the following as the BM4 HADDOCK dataset.

2.3.3 Training the graph kernel based scoring function (GraphRank)

We applied the commonly used SVM classifier C-SVC from LIBSVM (Chang and Lin, 2011) to train our scoring function. We precomputed the random walk graph kernel matrix () for the training data and used it as input of the SVM classifier. The SVM outputs the predicted decision values for a test case (the decision values from libsvm is defined as , where is the distance from a point to the hyperplane and is the weight vector of SVM that defines the classification hyperplane). To be consistent with energy terms which we later incorporated into iScore (the lower the energy, the better a model), we used the negative decision value from the SVM as the final score of GraphRank. The resulting optimized SVM classifier is denoted as the ‘GraphRank’ scoring function.

2.3.4 Integrating GraphRank score with energetic terms (iScore)

We combined the GraphRank score with three energetic terms from HADDOCK to train a simple linear scoring function named iScore. The HADDOCK energetic terms used are: Evdw, the intermolecular van der Waals energy described by a 12-6 Lennard-Jones potential; Eelec, the intermolecular electrostatic energy described by a Coulomb potential; Edesolv, an empirical desolvation energy term. The van der Waals and electrostatic energies are calculated using a 8.5 Å distance cut-off using the OPLS united atom force field (Jorgensen and Tirado-Rives, 1988). The GraphRank score and HADDOCK terms were normalized with the following equation: where the is a set of values for a specific term, is the median value of this term, is the interquartile range, which is the difference between the 75th and 25th percentiles. We optimized the weights of the various iScore terms (the normalized GraphRank score and energetic features) on the BM4 HADDOCK dataset (63 cases and 400 models/case), using a genetic algorithm (GA). We used the normalized discounted cumulative gain (nDCG) (Wang ) to evaluate the model ranking from each combination of the GraphRank score and energetic terms. This is a common measure of ranking quality for evaluating web search engine algorithms (Croft ). Specifically, nDCG is defined as follows: where is the discounted cumulative gain calculated over the total number of models (here in Eq. 11 is 400). is the ideal DCG (meaning all the hits are ranked at the top 1, 2, …, m, where is the total number of hits), and is the normalized DCG. is the ranking position of a model, is the weight of a model ranked at position . Here, we set if is a near-native model, and otherwise. Thus, the contribution of a model to DCG becomes 0 or , where is the ranking of the model. The fitness function for the GA optimization (maximization) was defined as squared values averaged over the N = 63 cases: The parameters of the GA optimization were: Population size = 800, maximum generations = 100, crossover rate = 0.8 and stopping tolerance = 0.001. The GA converged quickly, stopping at the 51th generation. The GA optimization was repeated 30 times and the median values were used as final weights.

2.4 Validation and comparison with state-of-the-art scoring functions

2.4.1 Validation on models from different docking programs

We validated iScore’s performance on docking models from four different docking programs: HADDOCK (Dominguez ; van Zundert ), SwarmDock (Torchala ), pyDock (Cheng ; Grosdidier ; Jiménez-García ) and ZDock (Pierce ; Pierce and Weng, 2007). These models were used to evaluate our scoring functions and compare them with the original scoring functions in these respective docking programs. The protein–protein complexes used for testing consist of the new entries from the protein–protein docking benchmark version 5.0 (BM5) (Vreven ), on which none of the docking software listed above has been previously trained. These new cases are not present in and hence are non-redundant with BM4, which is our training set. Antibody complexes were excluded. The HADDOCK docking models for the BM5 new cases were generated using predicted interface residues from CPORT (de Vries and Bonvin, 2011) as reported in the BM5 paper (Vreven ). The docking models for ZDock, pyDock and SwarmDock were taken from the work of Moal . In total, we could use 9, 18, 14 and 10 complexes for HADDOCK, SwarmDock, pyDock and ZDock, respectively, with the number of models per case varying from 125 to 500, for which at least one near-native model was present in the set of generated models. We used HADDOCK to calculate the intermolecular energies for the docking models from other docking programs. For this, the missing atoms of the models were built according to the OPLS force field topology with standard HADDOCK scripts using CNS (Brünger ). A short energy minimization (EM) was then performed with the following settings: 50 steps of conjugate gradient EM, van der Waals interactions truncated below the distance of 0.5 Å, and dielectric constant set to 1. Docking models originating from rigid-body docking programs, such as ZDock and pyDock, often contain clashes that a short EM cannot resolve. We removed those clashing models from the test dataset following the CAPRI assessment procedure: A clash is defined by a pair of heavy atoms between protein partners with a distance below 3 Å. We discarded all models with more than 0.1 clashes per Å2 of buried surface. The remaining docking models for each case were clustered with the fraction of common contacts (FCC) method (Rodrigues ) using a 0.6 cut-off and requiring a minimum number of 4 members per cluster. . We compared our performance with that of IRaPPA on models of the new BM5 complexes from SwarmDock, pyDock and ZDock (Moal ). The authors of IRaPPA kindly provided us their selection of top 10 models (one model from each of the top 10 clusters). This allowed us to compare our results with IRaPPA on a per model level. iScore’s default protocol of selecting top 10 models is to select top 2 models from the top 5 clusters for each target when applicable. If less than 5 clusters are present, iScore evenly selects top models from each cluster. In cases where the models are too diverse to be clustered (e.g. only 1 cluster with 4 models and the large majority of models not clustering), iScore selects all models from all available clusters, and then chooses the remaining models from not clustered models.

2.4.2 Validation on the CAPRI score set

The CAPRI score set consists of a set of models collected from CAPRI participants and used in the scoring experiment of CAPRI (Lensink and Wodak, 2014). During the CAPRI scoring competitions, each scoring group is asked to select top 10 models. We tested our scoring functions on this dataset and compared its performance with various scoring functions used in the CAPRI challenge. Docking models with clashes were removed as described above. Both dimers and multimers were considered here. We used 13 cases from the CAPRI score set with number of models ranging between 497 and 1987. Following the CAPRI assessment protocol, we considered only 10 models for assessment. iScore’s default model selection protocol was used, i.e. simply selecting the top 2 models of the top 5 clusters for each target.

3 Results

3.1 Training and optimization

We first trained a graph kernel-based scoring function called GraphRank using an SVM classifier. GraphRank ranks docking models based on their similarity to the native/non-native set of structures used in the training. The similarity is measured concerning interface topology and conservation profiles. The smaller the GraphRank score is, the more similar the docking model is to native complexes. We then trained iScore by integrating the GraphRank score with three intermolecular energy terms from HADDOCK (see Section 2). iScore consists of a linear combination of those four features whose weights were optimized on the BM4 HADDOCK docking models. To avoid extreme values of energies, we independently normalized the various terms for models from each case with their median and interquartile range values. The iScore function with its optimized weights is: where , , and are the normalized GraphRank score, Evdw, Eelec and Edesolv energies, respectively. The GraphRank score has the highest weight (0.941), indicating that the GraphRank score using PSSMs alone is the most important component of iScore. The success rates of HADDOCK score, GraphRank score and iScore on the BM4 HADDOCK dataset (63 cases) are shown in Figure 2. GraphRank scores are obtained by leaving-one-complex-out, i.e. we keep all models from one complex as the testing data and rank them after training GraphRank on the remaining complexes, and we repeat this process for all complexes. Compared with the energy-based HADDOCK score, the graph- and conservation-based GraphRank score has higher success rates. It is also evident that adding energetic features in iScore results in an improved scoring, reaching a success rate of 62% on the top 5 clusters in comparison with 59% for GraphRank.

Fig. 2.

Success rate of HADDOCK score, GraphRank and iScore on the BM4 HADDOCK training dataset over top N clusters of models

3.2 Benchmarking on docking software specific docking models and their respective scoring functions

Sampling and scoring are typically not independent components. They are often interrelated since a specific scoring method might depend on the sampling strategy followed and the representation of the system. We benchmarked here the performance of iScore and GraphRank, which are trained on HADDOCK models, on docking software-specific docking models and compared their performance with that of each software respective scoring function. For this, models from the new protein–protein complexes of Docking Benchmark 5 (Vreven ) were generated using four widely used docking programs: HADDOCK (Dominguez ; van Zundert ), SwarmDock (Torchala ), pyDock (Cheng ; Grosdidier ; Jiménez-García ) and ZDock (Pierce ; Pierce and Weng, 2007). The numbers of available complexes with near-native docking models for those four widely used docking programs are 9, 18, 14 and 10, respectively, with the numbers of docking models per complex varying from 125 to 500. The scoring performance was assessed with clustering of the docking models using our cluster procedure descried in Section 2. iScore outperforms HADDOCK, ZDOCK and pyDock scoring functions and competes with that of SwarmDock on their respective docking program-specific models (Fig. 3). On the HADDOCK models (Fig. 3A), iScore shows the same performance as GraphRank, both outperforming HADDOCK on the top2 to top4, reaching 33% success rate for top 5 clusters. For all the other model sets, iScore outperforms GraphRank. It shows a better scoring performance than the original scoring functions of pyDock (Fig. 3C) and ZDock (Fig. 3D), while the original SwarmDock scoring function remains the best in terms of scoring performance (Fig. 3B). iScore reaches a success rate of 36% and 60% (top 5 clusters) on pyDock and ZDock models, respectively, which is clearly a large improvement.

Fig. 3.

Success rates measured at cluster level on four sets of docking program-specific models for newly added protein–protein complexes in BM5. GraphRank and iScore are compared with scoring functions from HADDOCK (A), SwarmDock (B), pyDock (C) and ZDock (D) on the docking models of the corresponding docking program, respectively The scoring performance of iScore competes with that of IRaPPA (Moal ), a state-of-the-art machine learning based scoring function, on BM5 docking models generated by SwarmDock, pyDock and ZDock (Moal ) (for comparisons for each complex see Supplementary Table S2, and for overall performances see Table 1). IRaPPA identifies at least one hit for more cases than GraphRank and iScore, while iScore identifies higher-quality models for more cases than IRaPPA (Table 1). Specifically, iScore is successful in its top 10 models for 10, 6 and 6 cases for SwarmDock, pyDock and ZDock models, respectively, while IRaPPA for 12, 10 and 8, respectively. However, iScore appears to be more sensitive to high- or medium-quality models than IRaPPA: iScore and IRaPPA obtain 2 and 1 high-quality complexes on SwarmDock models, respectively and 5 and 3 medium-quality models on ZDock models, respectively. Considering that iScore was trained exclusively on BM4 HADDOCK models using a small number of features (1 for GraphRank, 4 for iScore) it performs well compared to IRaPPA, which exploits using 91 features and was separately trained on docking models generated by SwarmDock, ZDock and PyDock, respectively.

Table 1.

Comparison of GraphRank and iScore with IRaPPA on docking program-specific models of BM5 protein–protein complexes

Docking models	#Complexes	GraphRank	iScore	IRaPPA
SwarmDock	18	7/1*/6	10/2*/6	12/1*/6
pyDock	14	5/3**	6/3**	10/3**
ZDock	10	4/3**	6/5**	8/3**

Note: 10 models are selected and evaluated. The scoring performance for each complex is reported as the number of acceptable or better models (hits), followed by the number of high (indicated with ***) or medium quality models (**). The overall performance of each method on all complexes is reported here. For example, 7/1***/6** means that a scoring function is successful in 7 complexes, 1 complex out of the 7 complexes has at least a *** model and 6 out of 7 have at least a ** model in the top 10.

Comparison of GraphRank and iScore with IRaPPA on docking program-specific models of BM5 protein–protein complexes Note: 10 models are selected and evaluated. The scoring performance for each complex is reported as the number of acceptable or better models (hits), followed by the number of high (indicated with ***) or medium quality models (**). The overall performance of each method on all complexes is reported here. For example, 7/1***/6** means that a scoring function is successful in 7 complexes, 1 complex out of the 7 complexes has at least a *** model and 6 out of 7 have at least a ** model in the top 10.

3.3 iScore ranks among the top scorers on the CARPI score set

The scoring set from the CAPRI scoring experiments (Lensink and Wodak, 2014) is a valuable resource for evaluating scoring functions. CAPRI is a community-wide experiment for evaluating docking programs (started in 2001) (Janin, 2002) and scoring functions (from 2005 on). The CAPRI score set consists of 15 targets, 13 of which have near-native docking models. Each target has a mixture of 500–2000 models from the various docking programs used in the CAPRI prediction challenges (Table 2). This represents an ideal set for evaluating scoring functions independently of docking programs.

Table 2.

Comparison of GraphRank and iScore with CAPRI best performing group per target on the CAPRI score set

CAPRI targets	GraphRank	iScore	CAPRI best	# Total models	#Near-native
T29	4	4	9/5**	1979	166
T30	0	0	0	1148	2
T32	4/1**	4/1**	2	599	15
T35	0	0	1	497	3
T37	2/1**	4/2**	6/1***	1364	97
T39	0	0	0	1295	4
T40	4/3**	4/1***	10/10***	1987	535
T41	8	10/2**	10/2***	1101	347
T46	3	4	4	1570	24
T47	8/5*/3	10/6*/4	10/10***	1015	608
T50	0	4/3**	7/6**	1447	133
T53	5/1**	5/1**	8/3**	1360	122
T54	0	0	0	1304	19
Total	8/1*/4	9/2*/5	10/4*/3

Note: 10 models are selected and evaluated. The values are labeled in green/red when the performance of our scoring functions is better/worse than the CAPRI best scoring group. The scoring performance for each target is reported as the number of acceptable or better models (hits), followed by the number of high (indicated with ***) or medium quality models (**). For example, 8/2** means that there are totally 8 hits among the top 10 models, 2 models out of which are medium-quality models. The overall performance of each method on all 13 targets (the last row) is reported in a similar way. For example, 9/2***/5** means that a scoring function is successful in 9 targets, 2 targets out of 9 have at least a *** model and 5 out of 9 have at least a ** model in the top 10. Note that the CAPRI best column consists of results from 37 different groups (refer to Table 3 for a comparison of the performance per group and Supplementary Table S3 per target).

We benchmarked iScore and GraphRank on the models from the CAPRI score set and compared their performance with the reported performance of the various scoring functions/groups which participated to the CAPRI scoring experiments. Following the CAPRI assessment protocol, we selected only the top 10 ranked models for assessing the performance of iScore and GraphRank. This was done by selecting the top two models from each of the top five clusters for each target. The scoring performance of iScore and GraphRank on the 13 CAPRI targets containing near-native models is summarized in Table 2, together with the performance of the best scoring function/group in CAPRI for each target. Details of the performance of the various scoring functions compared for these targets are available in Supplementary Table S3. Again, iScore outperforms GraphRank (Table 2) demonstrating the synergistic effects of conservation information and the interacting energies in differentiating near-native models from docking artefacts. Further, iScore selected near-native models on the top10 for 9 out of 13 targets, with 2 targets having high-quality models and 5 having medium-quality models. As a comparison, selecting for each target the best CAPRI scoring function/group resulted in 10 out of 13 correctly predicted targets, with 4 and 3 targets having at least one high-quality and medium-quality models, respectively. Overall, iScore ranks among the top scorers on these 13 CAPRI scoring targets (Table 3). In total 37 scoring functions/groups were assessed (Supplementary Table S3), but only those that participated to at least 5 targets are shown in Table 3. When considering the common submitted targets (Supplementary Table S3), iScore still competes with the Weng group (8/2***/4** versus 8/3***/2**), the Bonvin group (8/2***/4** versus 8/2***/3**) and the Bates group (8/2***/4** versus 8/1***/4**). It should be noted that the CAPRI scoring groups, e.g. Weng and Bonvin groups, selected the 10 models with help of human expertise, while our selections were only generated from iScore and GraphRank without manual selection. Furthermore, the results clearly demonstrate the importance of the PSSM feature: GraphRank, using only the PSSM feature, already performs quite well (ranked in the 4th position). Rankings of GraphRank and iScore in comparison with the scorer groups on the CAPRI score set Note: In total 37 scorer groups were assessed (Supplementary Table S3), but only scorer groups that have submitted predictions for at least 5 out of the 13 CAPRI targets are shown here. The scoring functions/groups are ordered based on their performance. Number of targets with submitted predictions are shown for each function/group.

4 Discussion

We have developed a novel graph-kernel based scoring function, iScore, for scoring and ranking docking models of protein–protein complexes. By benchmarking on docking models from four different docking programs, iScore shows competitive or better success rate than the original scoring functions of those docking programs. Further, validation on CAPRI targets and comparison with CAPRI scorer groups highlight the high performance of iScore, which achieves the top success rate with acceptable or better models selected for 9 out of 13 CAPRI targets. It is worth noting that both GraphRank and iScore were trained on a rather small dataset, using a very limited set of features, only one for GraphRank and four for iScore. We can expect to further improve the performance of iScore, by increasing the size of the training set and enriching the node and edge labels of interface graphs. Our iScore software with MPI (Message Passing Interface) and GPU supports can be freely downloaded from: https://github.com/DeepRank/iScore. Currently, it takes about 15 min to rank 1619 models of a recent CAPRI target (a 6 domain protein, ranging from 83 to 112 amino acids) using 12 CPU cores (data for this CAPRI round not published yet). The usage of graph kernel on labeled graphs in iScore provides a novel way to score docking models. SPIDER (Khashan ) is also a graph-based scoring function but is drastically different from our GraphRank hence also iScore. SPIDER identifies common interface residue patterns (i.e. interfacial graph motifs) in native complexes and rank a docking model by counting the frequency of the interfacial graph motifs. First of all, GraphRank is based on graph kernel functions to calculate the interface similarities between a docking model and the training complexes while SPIDER is based on the frequent graph mining technique to identify interfacial graph motifs. Second, and importantly, the graphs used in SPIDER has only node labels with amino acid identity, while our GraphRank framework can potentially explore not only the properties of individual interface residues with node labels, but also the features of contacts between residues with edge labels. While we have only used node labels in this work (residue conservation profiles), the concept can easily be extended to add labels to the graph edges, for example in the form of residue–residue interaction energies and coevolution information. Third, iScore uses multi-scale representations of docked interfaces by combining atom-level energy terms with residue-level graph similarities, which allows to account for both subtle differences in 3D space, interaction topology and residue conservations at the same time. Both conservation profiles and intermolecular energies are important features for scoring of PPIs. Our scoring function GraphRank, using only conservation profiles of the interface residues as features, already shows a promising scoring performance. Physical energies have been widely used and identified as important features in state-of-the-art scoring functions and are complementary to evolutionary information. Considering the successful applications of intermolecular energies in existing scoring functions, in this work we simply combined three intermolecular energetic terms from HADDOCK with the conservation profiles-based GraphRank score. The resulting scoring function iScore outperforms GraphRank, indicating the significance of considering both evolutionary and energetic information in characterizing PPIs. When comparing the performance of iScore on models from different docking programs on BM5 new data, we do observe that iScore is able to improve the ranking over the original scoring functions for the rigid-body docking programs (pyDock and ZDock), while iScore does not really outperform the flexible docking programs like HADDOCK and SwarmDock which generate more optimized interfaces (Fig. 3). This might be related to the structure quality of the docking models. For docking models from flexible docking, their structures are already optimized to release steric clashes, while the rigid-body programs usually do not have such an optimization step, leading to unnatural interactions (clashes) within structures. To improve the structure quality of the docking models, we did apply a short energy minimization to optimize the structures before calculating intermolecular energies. With higher structure quality, like those coming out of SwarmDock and HADDOCK, the impact of this short minimization is smaller, and the resulting improvement of iScore versus the original scoring functions is less. Note that the current version of iScore does not work on antibody-antigen complexes, because PSSMs do not capture interface conservation in such complexes. Incorporation of antibody-antigen specific features into iScore is a topic of our ongoing work. By introducing the labeled graphs and graph kernel in our scoring function iScore, we pave the way for exploring more detailed features in the graph presentation of protein–protein complexes. Natural extensions of this work will be to include edge labels, for example residue–residue interaction energies and co-evolution. Considering graphs are natural representations of biomolecules, this general framework should be useful for other important macromolecular interaction related topics, such as binding affinity predictions, hot-spot predictions and rational design of protein interfaces. Click here for additional data file.

51 in total

1. Use of pair potentials across protein interfaces in screening predicted docked complexes.

Authors: G Moont; H A Gabb; M J Sternberg
Journal: Proteins Date: 1999-05-15

2. Docking and scoring protein interactions: CAPRI 2009.

Authors: Marc F Lensink; Shoshana J Wodak
Journal: Proteins Date: 2010-11-15

3. Scoring docking models with evolutionary information.

Authors: Michael Tress; David de Juan; Osvaldo Graña; Manuel J Gómez; Paulino Gómez-Puertas; Jose M González; Gonzalo López; Alfonso Valencia
Journal: Proteins Date: 2005-08-01

4. Prediction and scoring of docking poses with pyDock.

Authors: Solène Grosdidier; Carles Pons; Albert Solernou; Juan Fernández-Recio
Journal: Proteins Date: 2007-12-01

Review 5. Scoring functions for protein-protein interactions.

Authors: Iain H Moal; Rocco Moretti; David Baker; Juan Fernández-Recio
Journal: Curr Opin Struct Biol Date: 2013-07-18 Impact factor: 6.809

Review 6. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

Review 7. Integrative computational modeling of protein interactions.

Authors: João P G L M Rodrigues; Alexandre M J J Bonvin
Journal: FEBS J Date: 2014-03-26 Impact factor: 5.542

8. ZDOCK server: interactive docking prediction of protein-protein complexes and symmetric multimers.

Authors: Brian G Pierce; Kevin Wiehe; Howook Hwang; Bong-Hyun Kim; Thom Vreven; Zhiping Weng
Journal: Bioinformatics Date: 2014-02-14 Impact factor: 6.937

9. DockRank: ranking docked conformations using partner-specific sequence homology-based protein interface prediction.

Authors: Li C Xue; Rafael A Jordan; Yasser El-Manzalawy; Drena Dobbs; Vasant Honavar
Journal: Proteins Date: 2013-10-17

10. Prediction of protein-binding areas by small-world residue networks and application to docking.

Authors: Carles Pons; Fabian Glaser; Juan Fernandez-Recio
Journal: BMC Bioinformatics Date: 2011-09-26 Impact factor: 3.169

15 in total

1. A knowledge-based scoring function to assess quaternary associations of proteins.

Authors: Abhilesh S Dhawanjewar; Ankit A Roy; Mallur S Madhusudhan
Journal: Bioinformatics Date: 2020-06-01 Impact factor: 6.937

Review 2. Protein-Protein Docking: Past, Present, and Future.

Authors: Sharon Sunny; P B Jayaraj
Journal: Protein J Date: 2021-11-17 Impact factor: 2.371

3. Deep graph learning of inter-protein contacts.

Authors: Ziwei Xie; Jinbo Xu
Journal: Bioinformatics Date: 2021-11-10 Impact factor: 6.937

4. Prediction of protein assemblies, the next frontier: The CASP14-CAPRI experiment.

Authors: Marc F Lensink; Guillaume Brysbaert; Théo Mauri; Nurul Nadzirin; Sameer Velankar; Raphael A G Chaleil; Tereza Clarence; Paul A Bates; Ren Kong; Bin Liu; Guangbo Yang; Ming Liu; Hang Shi; Xufeng Lu; Shan Chang; Raj S Roy; Farhan Quadir; Jian Liu; Jianlin Cheng; Anna Antoniak; Cezary Czaplewski; Artur Giełdoń; Mateusz Kogut; Agnieszka G Lipska; Adam Liwo; Emilia A Lubecka; Martyna Maszota-Zieleniak; Adam K Sieradzan; Rafał Ślusarz; Patryk A Wesołowski; Karolina Zięba; Carlos A Del Carpio Muñoz; Eiichiro Ichiishi; Ameya Harmalkar; Jeffrey J Gray; Alexandre M J J Bonvin; Francesco Ambrosetti; Rodrigo Vargas Honorato; Zuzana Jandova; Brian Jiménez-García; Panagiotis I Koukos; Siri Van Keulen; Charlotte W Van Noort; Manon Réau; Jorge Roel-Touris; Sergei Kotelnikov; Dzmitry Padhorny; Kathryn A Porter; Andrey Alekseenko; Mikhail Ignatov; Israel Desta; Ryota Ashizawa; Zhuyezi Sun; Usman Ghani; Nasser Hashemi; Sandor Vajda; Dima Kozakov; Mireia Rosell; Luis A Rodríguez-Lumbreras; Juan Fernandez-Recio; Agnieszka Karczynska; Sergei Grudinin; Yumeng Yan; Hao Li; Peicong Lin; Sheng-You Huang; Charles Christoffer; Genki Terashi; Jacob Verburgt; Daipayan Sarkar; Tunde Aderinwale; Xiao Wang; Daisuke Kihara; Tsukasa Nakamura; Yuya Hanazono; Ragul Gowthaman; Johnathan D Guest; Rui Yin; Ghazaleh Taherzadeh; Brian G Pierce; Didier Barradas-Bautista; Zhen Cao; Luigi Cavallo; Romina Oliva; Yuanfei Sun; Shaowen Zhu; Yang Shen; Taeyong Park; Hyeonuk Woo; Jinsol Yang; Sohee Kwon; Jonghun Won; Chaok Seok; Yasuomi Kiyota; Shinpei Kobayashi; Yoshiki Harada; Mayuko Takeda-Shitaka; Petras J Kundrotas; Amar Singh; Ilya A Vakser; Justas Dapkūnas; Kliment Olechnovič; Česlovas Venclovas; Rui Duan; Liming Qiu; Xianjin Xu; Shuang Zhang; Xiaoqin Zou; Shoshana J Wodak
Journal: Proteins Date: 2021-09-13