| Literature DB >> 35785279 |
Lina Dong1, Xiaoyang Qu2, Binju Wang2.
Abstract
Prediction of protein-ligand binding affinities is a central issue in structure-based computer-aided drug design. In recent years, much effort has been devoted to the prediction of the binding affinity in protein-ligand complexes using machine learning (ML). Due to the remarkable ability of ML methods in nonlinear fitting, ML-based scoring functions (SFs) can deliver much improved performance on a selected test set, such as the comparative assessment of scoring functions (CASF), when compared to the classical SFs. However, the performance of ML-based SFs heavily relies on the overall similarity of the training set and the test set. To improve the performance and transferability of an SF, we have tried to combine various features including energy terms from X-score and AutoDock Vina, the properties of ligands, and the statistical sequence-related information from either the binding site or the full protein. In conjunction with extreme trees (ET), an ML model, we have developed XLPFE, a new SF. Compared with other tested methods such as X-score, AutoDock Vina, ΔvinaXGB, PSH-ML, or CNN-score, XLPFE achieves consistently better scoring and ranking power for various types of protein-ligand complex structures beyond the CASF, suggesting that XLPFE has superior transferability. In particular, XLPFE performs better with metalloenzymes. With its faster speed, improved accuracy, and better transferability, XLPFE could be usefully applied to a diverse range of protein-ligand complexes.Entities:
Year: 2022 PMID: 35785279 PMCID: PMC9245135 DOI: 10.1021/acsomega.2c01723
Source DB: PubMed Journal: ACS Omega ISSN: 2470-1343
Summary of the Data Sets
| source | numbers | |
|---|---|---|
| training set | PDBbind refined set (before 2018) | 4190 |
| test set 1 | PDBbind refined set (after 2018) | 394 |
| test set 2 | CASF-2016 | 285 |
Sample Size of the Nonredundant Set under Different Similarity Thresholds
| sequence similarities (%) | numbers of training sets | numbers of test sets |
|---|---|---|
| 100 | 4190 | 285 |
| 95 | 3949 | 158 |
| 90 | 3390 | 57 |
| 85 | 2824 | 23 |
| 80 | 2318 |
Summary of the Feature Sets
| feature set | terms | dimension |
|---|---|---|
| AutoDock Vina | 58 terms from the Vina source code | 58 |
| X-score | VDW, HB, RT, HS, HM, and HP | 6 |
| ligand | charge; C, N, O, H, F, P, S, Cl, Br, and I element numbers; and 1, 2, 3, am, and ar bond numbers | 16 |
| 20 amino acid numbers and crystal H2O number | 21 | |
| full protein | 20 amino acid numbers | 20 |
Figure 1Pearson correlation coefficients between the experimental data and the predicted binding affinities on (A) test set 1 and (B) test set 2 (CASF-2016) for combinations of different feature sets and ML methods. The dimension of features increases from the top to the bottom. The darker the color (blue), the higher the correlation, and the lighter the color (yellow), the lower the correlation.
Figure 2Pearson correlation coefficients between the experimental data and the predicted binding affinities for different sequence similarity thresholds of the training set using(A) VLP feature set and (B) XLPF feature set and different sequence similarity thresholds of the test set using(C) VLP feature set and (D) XLPF feature set. Different ML methods are shown in different colors: LR in blue, ET in orange, RF in green, NN in red, SVR in purple, and XGBoost in brown.
Figure 3(A) Correlation matrix of features and experimental values. (B) Feature importance. Feature importance values are calculated based on the number of times a feature is used to split the data across all trees. Here, the eight most significant features are shown.
Performance of XLPFE, X-Score, AutoDock Vina,ΔvinaXGB, CNN-Score, PSH-ML, and Lin_F9 Evaluated against a Set Consisting of 15 Selected Diverse Biological Targetsa
BACE-1, β-secretase 1; CHK1, serine/threonine-protein kinase chk1; DPP4, dipeptidyl peptidase 4; ER, estrogen receptor; GluR2, glutamate receptor 2; HIV PR, hiv-1 protease; HSP90, heat shock protein 90; LTA-4H, leukotriene A-4 hydrolase; P38a, mitogen-activated protein kinase 14; PDE4B, camp-specific3′,5′-cyclic phosphodiesterase 4b; PDK1, 3-phosphoinositide-dependent protein kinase 1; PTP1B, protein tyrosine phosphatase 1B; and SRC, proto-oncogene tyrosine protein kinase src. The more intensely red the table, the smaller the value, and the more intensely green the table the larger the value.
Figure 4Predicted affinities in complexes containing different types of metal atoms. Performances of XLPFE, PSH-ML, Lin_F9, and X-Score evaluated against a set consisting of five kinds of selected metal contained targets are listed in the table at the bottom of the figure.