Literature DB >> 21918611

The development of an affinity evaluation and prediction system by using protein-protein docking simulations and parameter tuning.

Koki Tsukamoto¹, Tatsuya Yoshikawa, Kiyonobu Yokota, Yuichiro Hourai, Kazuhiko Fukui.

Abstract

A system was developed to evaluate and predict the interaction between protein pairs by using the widely used shape complementarity search method as the algorithm for docking simulations between the proteins. We used this system, which we call the affinity evaluation and prediction (AEP) system, to evaluate the interaction between 20 protein pairs. The system first executes a "round robin" shape complementarity search of the target protein group, and evaluates the interaction between the complex structures obtained by the search. These complex structures are selected by using a statistical procedure that we developed called 'grouping'. At a prevalence of 5.0%, our AEP system predicted protein-protein interactions with a 50.0% recall, 55.6% precision, 95.5% accuracy, and an F-measure of 0.526. By optimizing the grouping process, our AEP system successfully predicted 10 protein pairs (among 20 pairs) that were biologically relevant combinations. Our ultimate goal is to construct an affinity database that will provide cell biologists and drug designers with crucial information obtained using our AEP system.

Entities: Chemical Disease Gene Species

Keywords: FFT; affinity analysis; massive parallel computing; protein-protein docking; protein-protein interaction

Year: 2009 PMID： 21918611 PMCID： PMC3169950 DOI： 10.2147/aabc.s3646

Source DB: PubMed Journal: Adv Appl Bioinform Chem ISSN： 1178-6949

Introduction

Proteins have diverse functions ranging from molecular machines to signaling. They participate in catalytic reactions, provide transport, form viral capsids, traverse membranes and construct regulated channels, transmit information from DNA to RNA, thus making it possible to synthesize new proteins. Given their importance, considerable effort has centered on the prediction of protein functions. A major way of achieving this is through the identification of binding partners. If we know the function of at least one of the components with which the protein interacts, that should enable us to determine its function(s) and the pathway(s) in which it plays a role. This holds since the vast majority of the functions performed by proteins in living cells involve protein–protein interactions (PPI).1,2 Hence, by observing the intricate network of these interactions we can map cellular pathways, their interconnectivities and their dynamic regulation. Their identification is at the heart of functional genomics; their prediction is crucial to the discovery of new drugs. Knowledge of these pathways, their topologies, lengths, and dynamics may provide useful information for forecasting side effects. The PPI affinity problem involves finding a protein pair that is interactive from a protein group with a known structure.1,2 Recently, pharmaceutical companies have been trying to produce medicines based on candidates discovered using protein– chemical compound docking simulations. Docking analysis with respect to PPI remains limited, however, because the macromolecules involved in these simulations are large and thus require extensive computational resources. PPI is a basic element of model construction in research that first analyzes the cell-signaling pathway and then identifies the target protein of a drug. The prediction of the interactions between vast numbers of proteins is crucial if we are to understand such biological phenomena as cellular signaling, enzyme reaction and gene expression regulation. PPI prediction has been extensively studied both experimentally3–7 and theoretically. In experimental studies, Giot and colleagues,8 and Stanyon and colleagues9 constructed a protein interaction map on a genome scale for Drosophila melanogaster, and Rain and colleagues10 constructed that for human Helicobacter pylori. Based on theoretical calculations, Wojcik and colleagues11 constructed an interaction map by using domain profile pairs in which the input was the amino acid sequence for H. pylori. These studies mainly used either genome scale analysis8,10 or sequence analysis.12 However, in vivo, PPI is achieved by using protein structural information. Therefore, ultimately, PPI evaluation and prediction should also utilize protein structural information. Such an interaction evaluation and prediction system has been discussed by Smith and colleagues,13 they did not consider the composition, performance or accuracy of the evaluation or the prediction of a concrete system. If we can perform such an ultimate PPI evaluation and prediction of the interaction between pairs of a large number of proteins, we can assume a signal transmission pathway, and then we can evaluate and predict the affinity between an enzyme and inhibitor(s) or substrate(s), an antibody and an antigen. Such a large-scale PPI affinity evaluation system has been discussed by Park and colleagues14 and Gong and colleagues.15 Their system, Protein Structural Interactome map (PSIMAP) (http://psimap.com/), performs evaluations by using Euclidean distance between two domains and the number of residue pairs within a given distance threshold as the PPI measurement base. This method is simple and fast. We know that the data provided by PSIMAP are useful, but such a method based on simply distance comparisons is not available for making PPI maps that include sufficient three-dimensional (3D) structural information for supporting a drug discovery. Therefore, in this study we developed a high-speed and more detailed PPI affinity evaluation and prediction system that is based on our original docking simulation program and the original high-speed FFT library produced by Hourai and colleagues16 and can be performed using a massive parallel PC cluster. In this system, which we call an affinity evaluation and prediction (AEP) system, first a “round robin” shape complementarity search is executed for the target protein group, and then the interaction between the proteins is evaluated based on complex structures determined by using a statistical procedure that we developed called ‘grouping’.

Methods

Evaluation of shape complementarity

When the shape complementarity17–19 between two proteins is evaluated, the structure of a protein is converted into a rigid body object represented in a 3D grid space. Typically, in the protein–protein docking simulation, each point of the protein represented on the grid is classified as being in the core area, the surface area, or the cavity area as described below. Because the accuracy of docking simulations that use the shape complementarity search depends strongly on the representation method, various representation methods have been developed. Here, we have developed a representation method called the voxel model, which has two distinguishing features. (1) The molecular surface can be faithfully modeled by changing the radius of the probe sphere to define as the extended solvent-accessible surface, which is traced out by the probe sphere center as it rolls over the protein, of each atomic type. (2) The shape complementarity can be determined much more accurately than just made voxel models by reducing the thickness of the molecular surface. However, when the thickness is reduced too much, the response of the shape complementarity evaluation function becomes too sensitive. Next we describe in detail the voxel model and the procedure for converting the crystal structure of a protein to the voxel model (voxelization).

Voxelization

The voxel model is expressed as shown in expression (1), where VOXEL (V) represents the voxel model, INNER (I) defines the area on the atom, where the protein exists, OUTER (O) is the area that is not INNER, in other words it indicates a solvent. SURFACE (S) is the area on the molecular surface of the protein, CAVITY (C) is the area inside SURFACE that is not INNER, namely, it means a cavity. N is a set of natural numbers, and φ is an empty set. n, n, n, and n are the upper bound values of each voxel type (I, O, S, C) decided from the molecular size of the protein. All the areas in the voxel space in the initial state are defined as DEFAULT (D), and a temporary molecular surface of protein is defined as TMP_SURFACE (S). The procedure of voxelization is formulated using expression (2). Figure 1 details each step of this procedure.

Figure 1

Six-step voxelization process for the quantization and characterization of protein surfaces in our AEP system.

Abbreviation: AEP, affinity evaluation and prediction.

Step 1 (DEFAULT): In the voxel space, all grid points are assigned to DEFAULT as the initial state. Step 2 (INNER): The grid points that exist within a spheroid of radius R (=RΔr) are assigned to INNER, where R is the van der Waals radius of each atom, and Δr is the increment in the radius of a defined type of atom. Step 3 (OUTER): All the remaining grid points are assigned as OUTER. Step 4 (TMP_SURFACE): The INNER grid points that adjoin the OUTER grid points are newly assigned to TMP_SURFACE. Step 5 (SURFACE): A TMP_SURFACE grid point is finally replaced by SURFACE or INNER according to the following process. When a TMP_SURFACE grid point is enclosed by six or more OUTER grid points, the TMP_SURFACE grid point is re-assigned to SURFACE. After this process has been completed for all TMP_SURFACE grid points, the remaining TMP_SURFACE grid points are re-assigned to INNER. Thus, finally, the TMP_SURFACE grid points vanish from the voxel space. Step 6 (CAVITY): The grid points which remain as DEFAULT are re-assigned to CAVITY. Hence, the DEFAULT grid points also disappear from the voxel space.

Score function of contact surface area

In the shape complementarity evaluation, the surface contact between proteins represented by the grid points is evaluated by using the score function of the contact surface area.17 This score function defines a score for INNER, OUTER, SURFACE, and CAVITY as described in Voxelization. The contact surface score function we adopted for our voxel model is the pair-wise shape complementarity (PSC) score function S that Weng and colleagues17 defined with ZDOCK.17,18,20 As defined in expressions (3) and (4), this S is a primitive function where the electrostatic interaction is not considered. The score function S consists of two factors R, L. The factor R(l, m, n) is for the receptor protein, which is a protein fixed on the grid space, and the factor L(l, m, n) is for the probe protein, which is a protein that is rotated and translated on the grid space. In expressions (3) and (4), l, m, and n are the coordinates of the grid point, and “Re[]” and “Im[]”, respectively, indicate the real and imaginary parts of the factor and the PSC scoring function. Moreover, bonusR, which appears in the real part of R(l, m, n), indicates “bonus” points (ie, additional points added to the PSC scoring) when the probe protein penetrates the pocket structure of the receptor protein. The final S, as shown by expression (4), is obtained by calculating the correlation function between R and L. The computational complexity of this calculation is O(N), but can be decreased to O(NlogN) by using a fast Fourier transform (FFT) as follows. Abbreviations: (DFT, Discrete Fourier Transform; IFT, Inverse Fourier transform). In the S calculation, we adopted the FFT library CONV3D developed by Hourai and Nukada.16 The CONV3D is highly optimized for specific CPU architectures (Pentium, Xeon, Athlon, POWER5, etc.), achieves high-speed performance, and is about three times faster than the widely used FFT library package FFTW21,22 (according to the results obtained with Protein–Protein Docking Benchmark 2.0;23 data not shown). At the beginning, a rotational search is performed in an actual evaluation of the contact surface between proteins. The rotational search works by explicitly rotating the ligand protein around each of its three Cartesian angles by a certain increment (15 degrees in this study). A total of 3600 rotated bodies of the ligand protein are thus generated. Next, the shape complementarity evaluation is undertaken for all rotated bodies. If the grid size of the ligand protein is N3, then N3 × 3600 PSC scores must be calculated. Based on preliminary experiments, in all the analyses we only extracted the top-ranked 512 PSC scores for the following procedures.

Reliability of shape complementarity evaluation

To confirm the reliability of the shape complementarity evaluation by using our voxel model, we performed simple docking calculations using some bound structures adopted from Weng’s Protein–Protein Docking Benchmark 2.0 as samples. It is widely recognized that the main challenge in protein docking lies in the unbound–bound transition. Although we are also more interested in the unbound experiments than in the bound, the latter enables us to primitively confirm the advantages of our AEP’s improvement at the initial stage. Because a protein complex for the latter, which is pulled apart and reassembled, is more complementary than individually crystallized component structures for the former. The widely used i-RMSD (backbone RMSD at the interface) was the criterion we used to evaluate the reliability. In this way we evaluated the difference between the complex structure obtained by using the docking calculation and the X-ray crystal structure.

Affinity analysis

To calculate protein-protein affinity, we assess the biological relevance of target proteins by statistically analyzing the characteristics of shape complementarity. The affinity analysis involved processing the 512 top-ranked values of S obtained from the shape complementarity evaluation by using a clustering method that we call grouping. The following describes the affinity prediction procedure and includes a description of the grouping method. Step 1 (Shape complementarity evaluation): A shape complementarity evaluation of protein pairs is performed. Only the 512 top-ranked S scores are then considered for step 2. Step 2 (Grouping): The purpose of step 2 is to classify the 512 S scores into 10 clusters, C (i = 1–10). First, the clustering method is performed hierarchically by using the nearest neighbor method for the 512 values of S as follows. We define the targeted dataset X = {x1, ..., x512} as consisting of 512 S scores from step 1. For each data point, x is represented as x = (S, T(T, T, T), R(Rθ, R, R)). Here, S = S from expression (4), T(T, T, T) is the amount of translation for the ligand protein in the grid space, and R(Rθ, R, R) is a rotation parameter for the ligand protein. If (o,p,q) in expression (4) is used, then o = T, p = T, and q = T in T(T, T, T), respectively. All data x belonging to the dataset X are sorted by the rank order of S. In X, we define the data point with the maximum S (maxS) as the central data y1 of cluster 1 (y1 ∈C1). Here, the term central data indicates data that represent a characteristic of a cluster. Now, we introduce the distances D1(y1, x) and D2(y1, x) between y1 and x. Here, in this dataset x, the data y1 is not included and is not classified into any clusters C (x ∈ (C)C). The distance D1 is represented as the Euclidean translation distance, and the rotational distance D2 is a value defined to represent the difference in rotation of data x and data y1. Here, we define the center of gravity of a ligand as point G and the coordinates of the Cα atom in the C- and N-terminals of a ligand as points C and N, respectively. We also define vector u as the normal vector of a plane, which is defined by three points (G, C, and N). Thus, the rotational difference D2 indicates the cosine between u of data x and u1 of data y1. When the following two conditions exist between the above data x (ie, x ∈ (C)C) and the central data y1, the data x are classified into cluster 1 (C1). When the translation distance |D1| between T of data x and T1 of the central data y1 is less than a certain translational threshold (eg, 10.0Å). When the rotational distance |D2| between x and y1 is less than a certain rotational threshold (eg, π/6). Cluster C1 is determined by repeating step 2 for all x where y1 is excluded. Similarly, step 2 is repeated for clusters C (i = 2–10). Step 2 ends either when the ten clusters (C1–C10) are generated or when all x belong to any cluster C. Step 3 (Calculation of Z-score): The purpose of step 3 is to prepare for the affinity analysis. The Z-score indicates the number of standard deviations that an observation is above or below the mean. It allows us to compare observations from different normal distributions. Here, a set of central data y of each cluster C obtained in step 2 is defined as the central dataset Y. The distribution Z(y) is the Z-score distribution of the PSC score of each set of central data y ∈ Y when the dataset X is considered a population. The distribution Z(C) is the distribution of the Z-score of m. Here m is the number of data that constitute cluster C when the dataset M = {m | i ∈ N, i ≧0} is considered a population. In step 3, the distributions Z(y) and Z(C) are calculated. Step 4 (Calculation of affinity score S): The measure for the affinity analysis is called an affinity score S, which is defined in expression (5) as the maximum value of a linear combination of Z(y) and Z(C) and their corresponding weighting factors w and w. The weighting factors w and w are introduced to determine the balance between the S score distribution Z(y) and the conformation distribution Z(C) when the affinity score S is calculated. The analysis of the affinity between protein pairs can be quantified by using S. Step 5: Steps 1 to 4 are repeated. For example, if the affinity analysis is needed for 20 receptors and 20 ligands (20 × 20), then steps 1 to 4 are repeated 20 times because there are 20 ligands for each receptor. The S matrix is obtained by repeating steps 1–5 for all protein pairs. When several target proteins have an unknown affinity, we can consider our AEP system to be an affinity prediction system. To evaluate the prediction performance of our AEP system, we used receiver operating characteristics curve (ROC) analysis. Based on the ROC analysis, we evaluated the maximum prediction performance of our AEP system by using a recall, a precision and a cutoff value. The parameters of w and w in expression (5) were defined at the maximum prediction performance.

Performance measures

Sensitivity and specificity are one of the performance measures for assessing the statistical significance of AEP’s predictions. The sensitivity or the recall measures the percentage of biologically relevant pairs which are correctly identified as having the high-affinity scores; and the specificity measures the percentage of biologically not relevant combinations which are identified as having the low-affinity. Here, the evaluation of each affinity scores (ie, high or low) is determined by a cutoff value from ROC analysis. Then, precision is the number of predictions statistically identified as belonging to the protein pairs divided by the total number of pairs. “Accuracy” is closely related to precision, measures the percentage of correctly identified protein pairs in the total number of pairs. Moreover, prevalence is defined as the total number of biologically relevant pairs in all the pairs. F-measure, which is the weighted harmonic mean of precision and recall, can be used as a single measure of performance.

Implementation

Our AEP system was implemented using C language, which we ran on the cluster PC (16 cores and 64 GB memory). To improve the performance, we parallelized the core program of AEP using the Message-Passing Interface (MPI) library. For example, the time needed for the docking calculation is about 12 seconds for a grid size of 1283 with a rotation resolution of 15 degrees. Thus, the peak performance of AEP is 7,200 protein–protein pairs per day.

Protein pair dataset

All protein pairs used in this study were selected manually from complex structures registered with the Protein Data Bank (PDB). We adopted the 20 protein pairs (bound dataset) shown in Table 1, and analyzed the affinity of each pair.

Table 1

Targeted 20 protein pairs

Protein Data Bank-ID	Index	Receptor	Ligand
1AY7	0	Ribonuclease SA	BARNase inhibitor
1BPL	1	GlucanoHydrolase:A	GlucanoHydrolase:B
1CAU	2	Canavalin:A	Canavalin:B
1D4V	3	TNF-related Apoposis inducing Ligand	Death receptor 5
1EAY	4	Chey:A,B	Chey-binding(P2) domain
1EM8	5	DNA polymerase II c subuinit	DNA polymerase II y subunit
1F2T	6	RAD50 ABC-ATPase N-term. domain	RAD50 ABC-ATPase C-term. domain
1F60	7	Elongation Factor EEF1A	Elongation Factor EEF1BA
1GO3	8	DNA-directed RNA polymerase subunit E	DNA-Directed RNA polymerase subunit F
1KA9	9	Imidazole glycerol phosphtate synthase:H	Imidazole glycerol phosphtate synthase:F
1RKE	10	Vinculin	VCL protein
1VET	11	Mitogen-activated protein kinase 1 interacting protein 1	Late endosomal/lysosomal Mp1 interacting protein
1WMH	12	Protein kinase C, iota type	Partitioning defective-6 homolog alpha
1WRD	13	Target of Myb protein 1	Ubiquitin
1XG2	14	Pectinesterase 1	Pectinesterase inhibitor
1Y96	15	Gem-associated protein 6	Gem-associated protein 7
2F9Z	16	Chemotaxis protein CheC	Chemotaxis methylation protein
2G2U	17	B-lactamase SHV-1	Beta-lactamase inhibitory protein
2HSN	18	Methionyl-tRNA synthetase	GU4 nucleic-binding protein 1
2NXN	19	Ribosomal protein L11 methyltransferase	50S ribosomal protein L11

In a docking simulation that uses a general shape complementarity search, the receptor is fixed to the grid space and the ligand is rotated and translated in the grid space.

Results and discussion

Validation of docking accuracy using our AEP system

We selected seven protein pairs from Protein–Protein Docking Benchmark 2.0 according to the distribution of difficulty (“moderate”, “medium”, and “difficult”) that Weng and colleagues defined. The breakdown of the difficulty in the seven selected pairs was five pairs in the moderate difficultly class (1DFJ24, 1F3425, 1HIA26, 1AKJ27, 1F5128), one pair (1I2M29) in the medium difficultly class and one pair (1EER30) in the difficult class. Table 2 summarizes the docking calculation results for seven protein pairs in which i-RMSD was used to confirm the reliability of the shape complementarity evaluation. For all seven pairs, i-RMSD < 1.10 Å, indicating enough accuracy for docking prediction. As regards 1F34, 1HIA, 1EER, 1I2M, and 1F51, the rank that gives these highly accurate results is within the top 20 ranks. On the other hand, the S for 1DFJ and 1AKJ is relatively low, and therefore their ranks are also low, namely 129th and 230th, respectively. However, the affinity analysis successfully predicted the active sites of these pairs, including 1DFJ and 1AKJ. Although a general docking simulation uses a more complex evaluation function, the results obtained here by using a primitive evaluation function consisting only of the shape complementarity were good. Based on these docking results (Table 2), the shape complementarity evaluation is sufficiently reliable for affinity analysis.

Table 2

Docking calculation results to validate the shape complementarity evaluation using voxel model

Complex (PDB-ID)	Description	i-RMSD (angstrom)	Rank
1DFJ	Ribonuclease A/RNase inhibitor	0.60	129
1F34	Porcine pepsin/Ascaris inhibitor-3	0.81	1
1HIA	Kallikrein/Hirustatin	0.76	7
1AKJ	MHC Class 1 HLA-A2/T-cell CD8 coreceptor	0.61	230
1EER	Erythropoietin/EPO receptor	1.02	9
1F51	Sporulation response factor B/Sporulation response factor F	0.73	20
1I2M	Ran GTPase/RCC1	1.09	5

Abbreviations: i-RMSD, backbone root mean square at the interface; PDB-ID, Protein Data Bank-ID.

Shape complementarity evaluation

Prediction performance comparison between AEP and ZDOCK 3.0.1

As mentioned in Evaluation of shape complementarity, an original docking simulation code is included in our AEP system. We evaluated the prediction performance of the system when replacing our code with ZDOCK 3.0.1,31 which is a widely used docking program, and compared the prediction accuracy of the ZDOCK 3.0.1 version with that of the original AEP. It is the purpose of the comparison to show the effect of voxelization, grouping, and parallelization, which we have introduced. We have already confirmed this effect. For this comparison, we adopted 20 different protein pairs (bound dataset) from the Protein–Protein Docking Benchmark 2.0. The highest score of ZDOCK 3.0.1 output was adopted as the affinity score for each protein pair. The prediction performance obtained using ZDOCK 3.0.1 is shown in Table 3. This table represents the advantages of performance parameters of AEP over ZDOCK 3.0.1. Here, these parameters are of the peak performance by using the chosen cutoff from the ROC analysis. The F-measure of AEP is 5.1% more accurate than that of ZDOCK 3.0.1. That means the high specificity of AEP over ZDOCK 3.0.1 while the difference of sensitivity is slightly 10.0% (ie, the difference of the value of true positives is only 2). AEP has an ability to correctly identify many true negatives (ie, TN = 307) with few false positives (ie, FP = 73) rather than ZDOCK 3.0.1 (ie, TN = 260, FP = 120). The ability to eliminate false positives is very important in terms of our goal that tries to deeply penetrate protein–protein networks. It is significantly valuable not to predict nonbonding pairs as well as identifying correct combinations.

Table 3

Comparison prediction performance between AEP and ZDOCK

Parameters	ZDOCK (grid size = 1.2)	AEP (grid size = 1.2)
F-measure	0.194	0.245
Accuracy	68.8	80.0
Sensitivity (Recall)	75.0	65.0
Specificity	68.4	80.8
Precision	11.1	15.1
AUC	0.776	0.744

Abbreviation: AEP, affinity evaluation and prediction; AUC, area under the curve.

Moreover, our AEP required much less computational time than ZDOCK 3.0.1. ZDOCK 3.0.1 needs over 3600 seconds for one protein pair whereas our AEP system needs only 12 seconds, namely it is 300 times faster. As shown by these results, we have revealed the advantage of our AEP with respect to prediction and computational performance.

The effect of grid size on the identification capability of our affinity analysis

The accuracy of the representation of the protein model is dominated by the grid-size of the voxelization procedure. In shape complementarity docking, the optimum grid size is defined as being 1.2 Å by Weng and colleagues17 as regards accuracy of docking, although, for our affinity analysis, which is derived from that shape complementarity docking scheme, the optimum grid size is unclear. Thus, we first investigated an optimal grid size for our affinity analysis. In this investigation using 20 × 20 protein pairs as shown in Table 1, the grid size ranged from 0.8 to 2.2 Å at 0.2 Å intervals. As the standard for our evaluation of the accuracy of the affinity analysis we used F-measure, which we obtained as the cutoff value from the ROC analysis. The results of F-measure are shown in Table 4. According to these results, the highest F-measure was derived with a grid size of 1.2 Å. Thus, we use this grid size for all analyses as mentioned below.

Table 4

Evaluations about the prediction performance for various of grid size

Grid size	0.8	1.0	1.2	1.4	1.6	1.8	2.0	2.2
F-measure	0.129	0.132	0.526	0.120	0.286	0.400	0.103	0.161
Accuracy	49.5	44.3	95.5	56.0	80.0	92.5	17.3	87.0
Sensitivity (Recall)	75.0	85.0	50.0	60.0	80.0	50.0	95.0	25.0
Specificity	48.2	42.1	97.9	55.8	80.0	94.7	13.2	90.3
Precision	7.1	7.2	55.6	6.7	17.4	33.3	5.4	11.9
AUC	0.580	0.654	0.779	0.560	0.800	0.718	0.488	0.479
Cutoff value	4.340	4.770	7.540	5.530	6.120	7.220	4.300	6.410

Abbreviation: AUC, area under the curve.

Evaluation for 20 × 20 protein pairs

Figure 2 shows the results of the shape complementarity evaluation for about 20 × 20 protein pairs. It also shows the normalized S for the receptor of each ligand. The vertical axis shows the normalized S where the maximum value is 1.0, and the horizontal axis shows the PDB index of the receptors (The relation between the index and PDB-ID is shown in Table 1).

Figure 2

Normalized PSC score S of the receptor of each ligand. The horizontal axis shows the PDB index as in Table 1, and the vertical axis shows the normalized PSC score S.

Abbreviations: PDB, Protein Data Bank; PSC, pair-wise shape complementarity.

The polygonal lines of 12 out of 20 receptors have no characteristic peak. For these receptors, the score of each predicted ligand is not the maximum (ie, S < 1.0) or the number of ligands with maximal score is not only one: 1AY732(index = 0), 1CAU33(2), 1D4V34(3), 1EAY35(4), 1EM836(5), 1VET37(11), 1WMH38(12), 1WRD39(13), 1XG240(14), 2F9Z41(16), 2HSN42(18), and 2NXN43(19). Thus, when using only the normalized S, it is too difficult to evaluate and quantify the affinity between ligand and receptor. Using our affinity calculation, Figures 3 and 4 show the affinity score S with w = 0 and w = 1. These values for w and w (parameters used for calculating affinity scores) were chosen to provide the maximum prediction performance (ie, F-measure) from our preliminary experiments.

Figure 3

Affinity score S of the receptor of each ligand. The cutoff value is 7.54σ. The filled arrowhead (▾) denotes a protein pair in which S ≥ 7.54σ and is thus biologically relevant; the unfilled arrowhead is a protein pair incorrectly identified as being biologically relevant (ie, false positive). The horizontal axis shows the PDB index, and the vertical axis shows the affinity score S.

Abbreviation: PDB, Protein Data Bank.

Figure 4

Matrix of affinity score S with weighting parameters w = 0, and w = 1. The points on the diagonal line show the biologically relevant pairs, and the circles show the biologically relevant pairs where the AEP system provided a successful prediction.

Abbreviation: AEP, affinity evaluation and prediction.

Figure 3 shows the S of the receptor of each ligand, and the vertical axis shows the S and the horizontal axis shows the PDB index of the receptor. When our AEP system is used as a prediction system, the horizontal dotted line shows that the cutoff value at which the system achieves its maximum prediction performance is 7.54σ. This value provides the optimal tradeoff between sensitivity and specificity. Each filled arrowhead (▾) denotes a protein pair where S exceeds the chosen cutoff value, thus indicating that the biologically relevant pair with a statistically significant score; each unfilled arrowhead is a protein pair incorrectly identified as being biologically relevant (ie, false positive). Figure 4 shows the S matrix in which the 10 data points (denoted by circles) along the diagonal of the matrix represent the protein pairs that have been experimentally confirmed to be biologically relevant. The vertical axis shows the index of the ligand, and the horizontal axis that of the receptor. Based on Figures 2 and 3, the S obtained by our grouping process is more suitable than the simple S for estimating protein–protein affinity. For example, the S for a specific ligand is the highest for the following receptors: 1BPL44(1), 1F6045(7), 1GO346(8), 1KA947(9), 1RKE48(10), 1VET(11), 1Y9649(15), 2G2U50(17), and 2NXN(19). These protein pairs with the highest score are also biologically relevant pairs. Moreover, for 1F2T51(6), the S of a biologically relevant pair is the 2nd high score. Our AEP system can therefore constitute a prediction system if we introduce a recall, a precision and a cutoff value. Our system predicts pairs whose S exceeds the cutoff value and are thus biologically relevant pairs. Based on the ROC curve in Figure 5, a cutoff value of 7.54σ demonstrates the maximal prediction performance of our system. When this cutoff is adopted, the performance of our AEP as a prediction system has an F-measure of 0.526, a recall of 50.0%, a precision of 55.6% (sensitivity of 50.0% and a specificity of 97.9%), and an accuracy of 95.5%. When we consider that the prevalence of the current sample size is 5%, this performance is sufficiently high. Our AEP system identified 10 of 20 biologically relevant combinations from among 400 pairs of combination candidates. In addition, because our AEP system can identify the true negative pairs, the specificity and the accuracy are sufficiently high at 97.9%, 95.5%, respectively. This means that the prediction provided by our AEP system is reliable.

Figure 5

The ROC curve for the prediction performance of the AEP system. The peak performance achieved at point P (cutoff value is 7.54σ), and their values are shown in the bottom-right corner.

Abbreviations: AEP, affinity evaluation and prediction; ROC, receiver operating characteristics.

Efficacy of “asymmetrically biased” PSC score function for affinity analysis results by comparison of molecular size

We compiled the relationship between the number of atoms in terms of a molecular size and the results of the affinity analysis as shown in Figure 6. There is a relation between the affinity results and the size of receptors or ligands. This size indicates the number of atoms for a protein. The correlation coefficient between the identification probability of biologically relevant pairs and the size of receptor is 0.40. This positive correlation is related to the characteristic of the score function of the shape complementarity. As mentioned above, the score function we used for the receptor is different from that for the ligand. Because the score function for a receptor has a “bias” that tends to pull into the ligand to own pocket structure, we expect our AEP system to exhibit increasing docking structure accuracy when searching between a “large” receptor and a “small” ligand. The current results reveal the efficacy as regards the evaluation of affinity. Moreover, because this characteristic is always good for a molecule that consists of a large number of atoms, to make the estimation performance more accurate, the dataset should be rearranged in terms of the assignment of either a receptor or a ligand by the order of the molecular size. In other word, a “large” molecule should always be assigned as a receptor regardless of whether it functions biologically as a receptor or a ligand.

Figure 6

Relation between the size of each protein pair (ie, number of atoms) needed for biological relevance and the affinity analysis results. The horizontal axis shows the number of atoms in the receptor, and the vertical axis shows that in the ligand. The symbols denote the protein pair. The unfilled circles (○) denote a protein pair that was identified as biologically relevant by the AEP system, and the filled circles (●) denotes a protein pair identified as not biologically relevant.

Abbreviation: AEP, affinity evaluation and prediction.

Problem with shape complementarity evaluation

In the current affinity analysis, our AEP system succeeded in predicting the following 10 biologically relevant protein pairs from 400 (ie, 20 × 20) protein pairs: 1BPL(1), 1F2T(6), 1F60(7), 1GO3(8), 1KA9(9), 1RKE(10), 1VET(11), 1Y69(15), 2G2U(17), and 2NXN(19). The common feature of those complex structures is that, except for 1F2T(6) and 1Y96(15), they have a typical docking type that we called “surface-to-surface”. The ”surface-to-surface” type as shown in Figure 7(a) is docking formed under a wide contact surface. Even the affinity evaluation was derived from a shape complementarity search, because in practice, the affinity analysis revealed that a shape complementarity search can identify a pair with such a docking type in a sophisticated manner from 400 protein pair candidates.

Figure 7

Models for docking styles. Type (a), called “surface-to-surface”, is most popular docking among our docking complex structures. Type (b) is called “edged-to-edge” contacting only using each edge of receptor and ligand. Type (c) is called “point-to-point” which is most less performance by shape complementarity search.

In addition, the docking type of 1F2T(6) or 1Y96(15) is different from the typical type as mentioned above. We call their docking type “edge-to-edge”. The “edge-to-edge” type is a docking style where the edge of the surface of a receptor comes in contact with a ligand edge as shown in Figure 7(b). Naturally, a shape complementarity search has difficulty in dealing with the “edge-to-edge” docking type. Indeed as shown in Figure 3, with 1F2T(6), even our AEP system succeeded in predicting the pair, and the affinity score from the correct pair is the 2nd highest score rather than the highest. Our AEP system also succeeded for 1Y96(15), however, the 2nd highest score from an irrelevant pair is also over the cutoff value of 7.54σ. On the other hand, our AEP system failed to predict the following 10 biologically relevant pairs: 1AY7(0), 1CAU(2), 1D4V(3), 1EAY(4), 1EM8(5), 1WHM(12), 1WRD(13), 1XG2(14), 2F9Z(16), and 2HSN(18). This prediction failure might be due to a structural feature of the protein pair that hinders the shape complementarity evaluation. Because the docking type of 1AY7(0), 1CAU(2), 1EAY(4), 1WHM(12), 1WRD(13), 1XG2(14), and 2F9Z(16) is “edge-to-edge”, the affinity scores S are low (ie, under 7.54σ) for almost all the ligands. The complex structures of 1D4V(3), 1EM8(5), and 2HSN(18) are “point-to-point” docking type. This docking type has a smaller contact surface area than the “edge-to-edge” type as shown in Figure 7(c). In fact, the docking site of the 1EM8(5) receptor is too small to allow us to detect the correct pair. This makes a shape complementarity search difficult to perform with this type. Most of the protein pairs that our AEP system failed to predict were not the typical docking type (ie, surface-to-surface) with a complex formation as mentioned above but edge-to-edge and/or point-to-point type, because the shape complementarity search cannot deal easily with these types. We are now extracting the common characteristics from these negative docking types and developing methods to overcome these issues.

Conclusion

A shape complementarity search method and a statistical method called ‘grouping’ were combined to evaluate and predict the affinity between various protein pairs with known structures. At a prevalence of 5.0%, prediction with our AEP system had a recall of 50.0%, a precision of 55.6% and an accuracy of 95.5%. An evaluation of the affinity was difficult when only the PSC score based on the shape complementarity evaluation was considered. The affinity evaluation was significantly improved by the grouping process that we introduced. Future research will be undertaken on the relation between the physicochemical characteristic of proteins and the prediction accuracy of our AEP system, and then, based on those findings, research will be undertaken on further improving the accuracy of the system. Although an unbound dataset was not considered in this study, our future research will also include how to improve the prediction accuracy of an unbound dataset.

48 in total

1. Structure of the TRAIL-DR5 complex reveals mechanisms conferring specificity in apoptotic initiation.

Authors: J Mongkolsapaya; J M Grimes; N Chen; X N Xu; D I Stuart; E Y Jones; G R Screaton
Journal: Nat Struct Biol Date: 1999-11

2. ZDOCK: an initial-stage protein-docking algorithm.

Authors: Rong Chen; Li Li; Zhiping Weng
Journal: Proteins Date: 2003-07-01

3. PSIbase: a database of Protein Structural Interactome map (PSIMAP).

Authors: Sungsam Gong; Giseok Yoon; Insoo Jang; Dan Bolser; Panos Dafas; Michael Schroeder; Hansol Choi; Yoobok Cho; Kyungsook Han; Sunghoon Lee; Hwanho Choi; Michael Lappe; Liisa Holm; Sangsoo Kim; Donghoon Oh; Jonghwa Bhak
Journal: Bioinformatics Date: 2005-03-03 Impact factor: 6.937

4. Structural and computational characterization of the SHV-1 beta-lactamase-beta-lactamase inhibitor protein interface.

Authors: Kimberly A Reynolds; Jodi M Thomson; Kevin D Corbett; Christopher R Bethel; James M Berger; Jack F Kirsch; Robert A Bonomo; Tracy M Handel
Journal: J Biol Chem Date: 2006-06-29 Impact factor: 5.157

5. Recognition of RNase Sa by the inhibitor barstar: structure of the complex at 1.7 A resolution.

Authors: J Sevcík; L Urbanikova; Z Dauter; K S Wilson
Journal: Acta Crystallogr D Biol Crystallogr Date: 1998-09-01

6. Efficiency of signalling through cytokine receptors depends critically on receptor orientation.

Authors: R S Syed; S W Reid; C Li; J C Cheetham; K H Aoki; B Liu; H Zhan; T D Osslund; A J Chirino; J Zhang; J Finer-Moore; S Elliott; K Sitney; B A Katz; D J Matthews; J J Wendoloski; J Egrie; R M Stroud
Journal: Nature Date: 1998-10-01 Impact factor: 49.962

7. Structure of a cell polarity regulator, a complex between atypical PKC and Par6 PB1 domains.

Authors: Yoshinori Hirano; Sosuke Yoshinaga; Ryu Takeya; Nobuo N Suzuki; Masataka Horiuchi; Motoyuki Kohjima; Hideki Sumimoto; Fuyuhiko Inagaki
Journal: J Biol Chem Date: 2004-12-07 Impact factor: 5.157

8. Structural basis for the interaction between pectin methylesterase and a specific inhibitor protein.

Authors: Adele Di Matteo; Alfonso Giovane; Alessandro Raiola; Laura Camardella; Daniele Bonivento; Giulia De Lorenzo; Felice Cervone; Daniela Bellincampi; Demetrius Tsernoglou
Journal: Plant Cell Date: 2005-02-18 Impact factor: 11.277

Review 9. Principles of protein-protein interactions.

Authors: S Jones; J M Thornton
Journal: Proc Natl Acad Sci U S A Date: 1996-01-09 Impact factor: 11.205

10. A protein interaction map of the mitotic spindle.

Authors: Jonathan Wong; Yuko Nakajima; Stefan Westermann; Ching Shang; Jung-Seog Kang; Crystal Goodner; Pantea Houshmand; Stanley Fields; Clarence S M Chan; David Drubin; Georjana Barnes; Tony Hazbun
Journal: Mol Biol Cell Date: 2007-07-18 Impact factor: 4.138