Literature DB >> 35573039

A simple spatial extension to the extended connectivity interaction features for binding affinity prediction.

Oghenejokpeme I Orhobor1, Abbi Abdel Rehim1, Hang Lou2, Hao Ni2,3, Ross D King1,4,3.   

Abstract

The representation of the protein-ligand complexes used in building machine learning models play an important role in the accuracy of binding affinity prediction. The Extended Connectivity Interaction Features (ECIF) is one such representation. We report that (i) including the discretized distances between protein-ligand atom pairs in the ECIF scheme improves predictive accuracy, and (ii) in an evaluation using gradient boosted trees, we found that the resampling method used in selecting the best hyperparameters has a strong effect on predictive performance, especially for benchmarking purposes.
© 2022 The Authors.

Entities:  

Keywords:  machine learning; protein binding affinity prediction; scoring functions

Year:  2022        PMID: 35573039      PMCID: PMC9066299          DOI: 10.1098/rsos.211745

Source DB:  PubMed          Journal:  R Soc Open Sci        ISSN: 2054-5703            Impact factor:   3.653


Background

It is commonplace to estimate the binding affinity of protein-ligand complexes using in silico scoring functions (models) built using statistical machine learning (ML) algorithms [1,2]. ML model performance on any predictive task largely depends on the quality of the descriptors used in building it. Therefore, one can argue that the performance of a scoring function is predicated on two key components: (i) the choice of ML algorithm and (ii) the quality of the input descriptors. Among the many ML algorithms that have been used in constructing scoring functions [3-6], gradient boosted trees (GBTs) have been identified as one of the top performers [4,7,8]. Like most sophisticated ML algorithms, GBTs have several hyperparameters that require tuning to achieve the best performance on a given problem. The selection of the best hyperparameters usually involves a search through the hyperparameter space and their quality tested using some form of resampling [9]. While several search strategies for hyperparameter selection have been proposed [10], this is not the focus of this work. Here, we focus on the effect of resampling technique in the identification of optimal hyperparameters. We demonstrate empirically that the choice of resampling technique has strong effects on the choice of optimal hyperparameters, and by extension, the performance of a constructed scoring function. Although this is well known in the ML community, it is often a footnote or ignored in the protein-binding affinity prediction literature. Several descriptors for the representation of protein-ligand complexes for use in building ML-based scoring functions have been proposed [8,11]. These descriptors often implement principles from extended connectivity fingerprints [12]. A recent approach which has been shown to achieve state-of-the-art performance is the extended connectivity interaction features [8] (ECIF), which identified 22 atom-types for proteins, and 70 atom-types for ligands based on connectivity. Pairs of these protein-ligand atom-types within a distance threshold are used as descriptors, where the value for each protein-ligand complex is the frequency of its occurrence. In this work, we extend this approach by distinguishing between shorter and longer descriptor interactions, and refer to this approach as pair distance ECIF (PDECIF). We evaluated the proposed approach using the comparative assessment of scoring functions (CASF) family of benchmark datasets [5,13,14], and demonstrate that PDECIF outperforms ECIF when paired with GBTs. Our contributions are as follows: We demonstrate the effect the choice of resampling technique when optimizing ML algorithm hyperparameters has on scoring function performance. Our analysis shows that the difference in predictive performance is not statistically significant. However, it is worth noting that the observed differences are indeed important, especially when comparing the performance of different scoring functions. We propose PDECIF, an extension to ECIF which outperforms its predecessor.

Methods

ECIF and PDECIF

A ligand descriptor in the ECIF framework consists of an atom’s symbol, explicit valence, number of attached heavy atoms, number of attached hydrogens, aromaticity and ring membership. Each of these properties can be represented textually where each property is separated by a semicolon. For example C;4;3;0;0;0. Protein descriptors are formulated in the same way where Protein Data Bank (PDB) residue and atom pairs map to an ECIF atom of the same form. For example, ASN-OXT maps to O;2;1;0;0;0. Given a protein-ligand complex and under a prespecified distance threshold, for example 6 Å, a valid ECIF descriptor consists of a protein-ligand atom pair of the form C;4;3;0;0;0-O;2;1;0;0;0. The assigned numerical value is the number of times it occurs in the complex under the distance threshold. In total, there are 1540 of such pairs, see the original work for the complete set [8]. In contrast to ECIF, rather than specify a maximum distance, we specify a distance below which we consider short and above which we consider long. The distance for long-range interactions is uncapped. So using the example for the ECIF descriptor above, we have the following descriptors for short and long, respectively; C;4;3;0;0;0-O;2;1;0;0;0-l and C;4;3;0;0;0-O;2;1;0;0;0-h. Note that the choice of ‘-l’ and ‘-h’ are simply as delineating symbols for the distances and have no intrinsic meaning.

Datasets

We performed our evaluation using the CASF 2007, 2013, 2016 and 2019 benchmark datasets [5,13,14]. They all have independent train and test sets, with 1090–210, 2764–195, 3772–285 and 9291–285 train–test samples for the aforementioned datasets, respectively. We used this existing split in our experiments. Note that the CASF 2019 set shares the same test set as the CASF 2016 set, albeit with more training samples. We read the raw data from PDBbind; where RDKit was incapable of reading the ligand files, Open Babel (v. 3.1.1) was used to convert these into mol files that were fully compatible with RDKit. All files were then saved in mol format before use. We considered four distance thresholds for both ECIF and PDECIF: 4 Å, 6 Å, 8 Å and 10 Å. Table 1 shows the number of features generated using ECIF and PDECIF at the different distance thresholds for the benchmark datasets. Note that the number of features for the ECIF datasets is never up to the reported 1540 in the original. This is because we only included features for which at least one complex has a non-zero value. We also considered a case were the ECIF and PDECIF datasets are augmented with 194 ligand features [15] generated using RDKit. The features are described in electronic supplementary material, S1.
Table 1

The number of features generated for each of the benchmark datasets using the ECIF and PDECIF approaches at different distances (angstroms).

benchmarkdistanceECIFPDECIF
CASF 200748562178
611612482
812442563
1012852595
CASF 201349962290
612262520
812682561
1012882579
CASF 2016410782485
613322739
813762781
1013992803
CASF 2019411762584
613622770
813892795
1014022807
The number of features generated for each of the benchmark datasets using the ECIF and PDECIF approaches at different distances (angstroms).

Evaluation set-up

We used GBT as our learner of choice. The hyperparameters were optimized using a grid search, and kept all but the number of rounds, max depth and learning rate as default values, where the number of rounds = (500, 1000, 1500, 2000), max depth = (2, 4, 6, 8) and learning rate = (0.001, 0.01, 0.1, 0.2, 0.3). Selection of the best combination of these parameters was performed using only the training data for each of the benchmarks. We considered three resampling approaches; train–test split (70–30%) and cross-validation (CV) with k = (5, 10). These were performed once and without repetition. Having identified the best performing hyperparameters and building the final model, we report the Pearson correlation coefficient (R) and root mean square error (RMSE). For both of the performance metrics, we report model performance on each benchmark’s test set. We performed feature selection on the overall best performing benchmark dataset and representation method pair using the Boruta algorithm [16] with a maximum of 500 runs, a p-value of 0.01, and a random forests [17] backend. The code used in generating the datasets and performing the experiments is available at https://github.com/oghenejokpeme/PDECIF. Links to all datasets used are also provided in the aforementioned repository.

Results and discussion

Comparison of ECIF and PDECIF

Our results show that PDECIF generally outperforms ECIF. This is irrespective of distance, the presence of ligand features or the resampling method used in tuning the predictive model. We also observed that the best performing representation for all the benchmark years included the ligand features, albeit at different distance thresholds. Crucially, we wanted to know if (i) the resampling method used in selecting the best set of hyperparameters for the predictive model affects performance, and (ii) there is a difference in predictive performance between the different representations. In the first case, paired t-tests indicate that the difference in performance is not statistically significant when the three resampling methods are paired up and compared across all benchmark years, distances and representations. However, with a significance level of 0.01 paired t-tests indicate a significant difference in performance when the different dataset representations are paired and compared across the different resampling methods for the latter (table 2).
Table 2

P-values from paired t-test statistical testing of the difference in predictive performance (R) between the considered representations across the different resampling methods.

representation pairstrain–testCV5CV10
ECIF − ECIF + Ligand9.369 × 10−51.918 × 10−41.502 × 10−4
ECIF − PDECIF2.133 × 10−36.220 × 10−36.419 × 10−3
ECIF − PDECIF + Ligand5.471 × 10−51.271 × 10−41.391 × 10−4
ECIF + Ligand − PDECIF5.364 × 10−11.388 × 10−11.188 × 10−1
ECIF + Ligand − PDECIF + Ligand6.175 × 10−35.564 × 10−43.924 × 10−3
PDECIF − PDECIF + Ligand3.688 × 10−56.243 × 10−81.052 × 10−6
P-values from paired t-test statistical testing of the difference in predictive performance (R) between the considered representations across the different resampling methods. On the CASF 2019 benchmark dataset, ECIF’s best performance (R/RMSE) is 0.842/1.258 and 0.858/1.208 without and with ligand features, respectively, both of which are at a distance of 10 Å. By contrast, PDECIF’s best performance is 0.854/1.235 and 0.862/1.204 without and with ligand features, respectively, where the former is at a distance of 4 Å and the latter is at a distance of 10 Å. It is worth noting that the results for ECIF differ from those reported by Sánchez-Cruz et al. [8], where ECIF’s best performance is 0.857/1.193 without ligand features, 0.866/1.169 with ligand features. In their work, the authors refer to the CASF-2019 benchmark dataset here as CASF-2016. Our results are more conservative. We believe this is the case because although the same raw files were retrieved from PDBbind, the preprocessing steps we used in generating the mol files which we then use in generating the representations differ. However, our results confirm the following findings reported in their prior work: (i) inclusion of ligand features significantly improves predictive performance, and (ii) ECIF and by extension PDECIF outperforms other state-of-the-art approaches such as convolutional neural network (CNN) architectures such as KDEEP [3] (0.82/1.27) and TopBP-DL [18] (0.848/1.210). Furthermore, what we propose outperforms other state-of-the-art GBT-based scoring functions like AGL-Score [4] and EIC-Score [7] with Pearson R coefficients of 0.833 and 0.828 on the CASF-2016 benchmark. See tables 3 and 4 for our complete set of results.
Table 3

Predictive performance (R/RMSE) for the ECIF and PDECIF representations with and without the ligand features for the CASF 2007 and 2013 benchmark datasets when the hyperparameters for the predictive model are selected using the train–test and cross-validation (k = {5, 10}) resampling methods. For each benchmark year and distance pair, the best performing representation (with and without ligand features) and resampling method is in italics. The overall best performing combination for the given benchmark dataset is in boldface.

year—distancerepresentationtrain–testCV5CV10
CASF 2007—4ECIF0.739/1.6630.736/1.6650.729/1.692
PDECIF0.811/1.4580.807/1.4680.802/1.482
ECIF + ligand0.759/1.5830.787/1.5620.783/1.562
PDECIF + ligand0.811/1.4670.817/1.4680.812/1.471
CASF 2007—6ECIF0.812/1.4720.808/1.4980.821/1.450
PDECIF0.803/1.4940.808/1.4670.806/1.472
ECIF + ligand0.814/1.4590.820/1.4550.812/1.460
PDECIF + ligand0.823/1.4300.826/1.4280.817/1.446
CASF 2007—8ECIF0.805/1.4680.813/1.4490.815/1.446
PDECIF0.816/1.4550.812/1.4500.811/1.460
ECIF + ligand0.815/1.4720.818/1.4430.820/1.442
PDECIF + ligand0.827/1.4180.825/1.4180.828/1.408
CASF 2007—10ECIF0.811/1.4730.820/1.4290.811/1.448
PDECIF0.811/1.4760.808/1.4760.807/1.481
ECIF + ligand0.802/1.4960.820/1.4610.817/1.438
PDECIF + ligand0.814/1.4860.822/1.4440.818/1.440
CASF 2013—4ECIF0.708/1.6290.694/1.6550.717/1.613
PDECIF0.762/1.5220.773/1.4990.779/1.490
ECIF + ligand0.777/1.4840.776/1.4800.778/1.481
PDECIF + ligand0.800/1.4320.798/1.4310.801/1.429
CASF 2013—6ECIF0.772/1.4840.779/1.4750.774/1.478
PDECIF0.792/1.4490.786/1.4610.783/1.467
ECIF + ligand0.801/1.4190.791/1.4370.790/1.439
PDECIF + ligand0.811/1.4050.802/1.4200.817/1.384
CASF 2013—8ECIF0.772/1.4870.769/1.4970.772/1.483
PDECIF0.774/1.4850.784/1.4590.783/1.465
ECIF + ligand0.799/1.4200.797/1.4230.799/1.422
PDECIF + ligand0.800/1.4200.804/1.4100.806/1.402
CASF 2013—10ECIF0.781/1.4640.779/1.4690.786/1.458
PDECIF0.780/1.4690.775/1.4780.778/1.472
ECIF + ligand0.800/1.4200.798/1.4160.809/1.396
PDECIF + ligand0.798/1.4210.796/1.4240.797/1.423
Table 4

Predictive performance (R/RMSE) for the ECIF and PDECIF representations with and without the ligand features for the CASF 2016 and 2019 benchmark datasets when the hyperparameters for the predictive model are selected using the train–test and cross-validation (k = {5, 10}) resampling methods. For each benchmark year and distance pair, the best performing representation (with and without ligand features) and resampling method is in italics. The overall best performing combination for the given benchmark dataset is in boldface.

year—distancerepresentationtrain–testCV5CV10
CASF 2016—4ECIF0.752/1.4970.752/1.4950.748/1.501
PDECIF0.816/1.3340.823/1.3170.818/1.329
ECIF + ligand0.818/1.3350.822/1.3190.822/1.323
PDECIF + ligand0.841/1.2720.840/1.2730.839/1.275
CASF 2016—6ECIF0.808/1.3430.811/1.3350.802/1.353
PDECIF0.833/1.2770.833/1.2800.828/1.293
ECIF + ligand0.840/1.2630.840/1.2600.829/1.284
PDECIF + ligand0.843/1.2520.840/1.2580.844/1.248
CASF 2016—8ECIF0.806/1.3430.804/1.3500.797/1.361
PDECIF0.823/1.3030.829/1.2900.824/1.305
ECIF + ligand0.831/1.2810.832/1.2750.838/1.263
PDECIF + ligand0.831/1.2760.843/1.2480.842/1.256
CASF 2016—10ECIF0.815/1.3200.812/1.3280.816/1.314
PDECIF0.825/1.2980.823/1.3000.830/1.288
ECIF + ligand0.844/1.2450.842/1.2520.842/1.256
PDECIF + ligand0.842/1.2520.844/1.2460.839/1.260
CASF 2019—4ECIF0.793/1.4240.795/1.4170.791/1.426
PDECIF0.854/1.2350.853/1.2390.851/1.249
ECIF + ligand0.833/1.2940.833/1.2890.832/1.290
PDECIF + ligand0.859/1.2170.855/1.2230.859/1.212
CASF 2019—6ECIF0.832/1.2840.837/1.2720.833/1.284
PDECIF0.850/1.2360.850/1.2400.849/1.241
ECIF + ligand0.847/1.2370.848/1.2300.853/1.223
PDECIF + ligand0.859/1.2080.860/1.2010.860/1.204
CASF 2019—8ECIF0.831/1.2900.836/1.2810.839/1.268
PDECIF0.845/1.2510.848/1.2440.849/1.239
ECIF + ligand0.854/1.2220.852/1.2300.851/1.227
PDECIF + ligand0.857/1.2150.862/1.2040.854/1.222
CASF 2019—10ECIF0.832/1.2820.842/1.2580.837/1.271
PDECIF0.848/1.2450.849/1.2480.851/1.236
ECIF + ligand0.856/1.2110.858/1.2080.854/1.217
PDECIF + ligand0.855/1.2150.859/1.2070.857/1.203
Predictive performance (R/RMSE) for the ECIF and PDECIF representations with and without the ligand features for the CASF 2007 and 2013 benchmark datasets when the hyperparameters for the predictive model are selected using the train–test and cross-validation (k = {5, 10}) resampling methods. For each benchmark year and distance pair, the best performing representation (with and without ligand features) and resampling method is in italics. The overall best performing combination for the given benchmark dataset is in boldface. Predictive performance (R/RMSE) for the ECIF and PDECIF representations with and without the ligand features for the CASF 2016 and 2019 benchmark datasets when the hyperparameters for the predictive model are selected using the train–test and cross-validation (k = {5, 10}) resampling methods. For each benchmark year and distance pair, the best performing representation (with and without ligand features) and resampling method is in italics. The overall best performing combination for the given benchmark dataset is in boldface.

Feature importance

We performed feature selection on the CASF 2019 benchmark training dataset with the PDECIF representation at a distance of 8 Å augmented with the ligand features, as it was the best performing (table 4). This dataset has 2912 features. The Boruta feature selection algorithm identified 817 confirmed important features. The top 20 of these features by mean importance in descending order are: C;4;3;1;0;0-C;4;3;0;1;1-h, N;3;2;1;0;0-C;4;3;0;1;1-h, O;2;1;0;0;0-C;4;3;0;1;1-h, C;4;3;0;0;0-C;4;3;0;1;1-h, Crippen_MolLogP, C;4;1;3;0;0-C;4;3;0;1;1-h, C;4;2;1;1;1-C;4;3;0;1;1-h, C;4;3;0;1;1-C;4;3;0;1;1-h, SlogP_VSA2, Chi2v, C;4;2;2;0;0-C;4;3;0;1;1-h, O;2;1;0;0;0-C;4;3;0;1;1-l, Chi3v, C;4;3;0;1;1-C;4;2;1;1;1-h, C;4;1;3;0;0-C;4;3;0;1;1-l, C;4;2;1;1;1-C;4;2;1;1;1-l, Chi1v, MaxAbsPartialCharge, Chi4v, C;4;2;1;1;1-C;4;3;0;1;1-l. The full set of important features are provided in the electronic supplementary material. It is worth noting that 121 of the 194 ligand features we considered were selected as part of the 817 important features. Having identified these features using the training set, we performed additional experiments using just them and the same tuning configuration discussed in the previous section. The performance (R/RMSE) for the train–test, CV5 and CV10 resampling methods are 0.852/1.224, 0.851/1.224 and 0.849/1.228 respectively. This means that they achieve approximately 99.4%, 98.7% and 99.4% of the full dataset performance (see row ‘CASF 2019—8’ and entry ‘PDECIF + ligand’ in table 4).

Conclusion

In this paper, we have presented a simple extension to the ECIF representation approach for the in silico prediction of protein binding affinity. Our results show that the extension significantly outperforms the base approach on the CASF benchmark datasets. Furthermore, we show that for GBTs, the resampling method used in optimizing the hyperparameters affects predictive accuracy. This is particularly important when comparing against other scoring functions, where progress is measured using performance on benchmark datasets. However, our experiments show that the difference in performance gain is not statistically significant.
  12 in total

1.  Extended-connectivity fingerprints.

Authors:  David Rogers; Mathew Hahn
Journal:  J Chem Inf Model       Date:  2010-05-24       Impact factor: 4.956

2.  Comparative assessment of scoring functions on a diverse test set.

Authors:  Tiejun Cheng; Xun Li; Yan Li; Zhihai Liu; Renxiao Wang
Journal:  J Chem Inf Model       Date:  2009-04       Impact factor: 4.956

3.  Comparative assessment of scoring functions on an updated benchmark: 1. Compilation of the test set.

Authors:  Yan Li; Zhihai Liu; Jie Li; Li Han; Jie Liu; Zhixiong Zhao; Renxiao Wang
Journal:  J Chem Inf Model       Date:  2014-06-02       Impact factor: 4.956

4.  Extended connectivity interaction features: improving binding affinity prediction through chemical description.

Authors:  Norberto Sánchez-Cruz; José L Medina-Franco; Jordi Mestres; Xavier Barril
Journal:  Bioinformatics       Date:  2021-06-16       Impact factor: 6.937

5.  KDEEP: Protein-Ligand Absolute Binding Affinity Prediction via 3D-Convolutional Neural Networks.

Authors:  José Jiménez; Miha Škalič; Gerard Martínez-Rosell; Gianni De Fabritiis
Journal:  J Chem Inf Model       Date:  2018-01-29       Impact factor: 4.956

6.  DG-GL: Differential geometry-based geometric learning of molecular datasets.

Authors:  Duc Duy Nguyen; Guo-Wei Wei
Journal:  Int J Numer Method Biomed Eng       Date:  2019-02-07       Impact factor: 2.747

7.  AGL-Score: Algebraic Graph Learning Score for Protein-Ligand Binding Scoring, Ranking, Docking, and Screening.

Authors:  Duc Duy Nguyen; Guo-Wei Wei
Journal:  J Chem Inf Model       Date:  2019-07-01       Impact factor: 4.956

8.  Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening.

Authors:  Zixuan Cang; Lin Mu; Guo-Wei Wei
Journal:  PLoS Comput Biol       Date:  2018-01-08       Impact factor: 4.475

9.  OnionNet: a Multiple-Layer Intermolecular-Contact-Based Convolutional Neural Network for Protein-Ligand Binding Affinity Prediction.

Authors:  Liangzhen Zheng; Jingrong Fan; Yuguang Mu
Journal:  ACS Omega       Date:  2019-09-16

10.  Development of a protein-ligand extended connectivity (PLEC) fingerprint and its application for binding affinity predictions.

Authors:  Maciej Wójcikowski; Michał Kukiełka; Marta M Stepniewska-Dziubinska; Pawel Siedlecki
Journal:  Bioinformatics       Date:  2019-04-15       Impact factor: 6.937

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.