Literature DB >> 32085675

Tapping on the Black Box: How Is the Scoring Power of a Machine-Learning Scoring Function Dependent on the Training Set?

Minyi Su1,2, Guoqin Feng1,2, Zhihai Liu1, Yan Li1,3, Renxiao Wang1,3,4.   

Abstract

In recent years, protein-ligand interaction scoring functions derived through machine-learning are repeatedly reported to outperform conventional scoring functions. However, several published studies have questioned that the superior performance of machine-learning scoring functions is dependent on the overlap between the training set and the test set. In order to examine the true power of machine-learning algorithms in scoring function formulation, we have conducted a systematic study of six off-the-shelf machine-learning algorithms, including Bayesian Ridge Regression (BRR), Decision Tree (DT), K-Nearest Neighbors (KNN), Multilayer Perceptron (MLP), Linear Support Vector Regression (L-SVR), and Random Forest (RF). Model scoring functions were derived with these machine-learning algorithms on various training sets selected from over 3700 protein-ligand complexes in the PDBbind refined set (version 2016). All resulting scoring functions were then applied to the CASF-2016 test set to validate their scoring power. In our first series of trial, the size of the training set was fixed; while the overall similarity between the training set and the test set was varied systematically. In our second series of trial, the overall similarity between the training set and the test set was fixed, while the size of the training set was varied. Our results indicate that the performance of those machine-learning models are more or less dependent on the contents or the size of the training set, where the RF model demonstrates the best learning capability. In contrast, the performance of three conventional scoring functions (i.e., ChemScore, ASP, and X-Score) is basically insensitive to the use of different training sets. Therefore, one has to consider not only "hard overlap" but also "soft overlap" between the training set and the test set in order to evaluate machine-learning scoring functions. In this spirit, we have complied data sets based on the PDBbind refined set by removing redundant samples under several similarity thresholds. Scoring functions developers are encouraged to employ them as standard training sets if they want to evaluate their new models on the CASF-2016 benchmark.

Entities:  

Mesh:

Substances:

Year:  2020        PMID: 32085675     DOI: 10.1021/acs.jcim.9b00714

Source DB:  PubMed          Journal:  J Chem Inf Model        ISSN: 1549-9596            Impact factor:   4.956


  12 in total

1.  Lin_F9: A Linear Empirical Scoring Function for Protein-Ligand Docking.

Authors:  Chao Yang; Yingkai Zhang
Journal:  J Chem Inf Model       Date:  2021-09-01       Impact factor: 6.162

2.  Scoring Functions for Protein-Ligand Binding Affinity Prediction using Structure-Based Deep Learning: A Review.

Authors:  Rocco Meli; Garrett M Morris; Philip C Biggin
Journal:  Front Bioinform       Date:  2022-06-17

Review 3.  Delta Machine Learning to Improve Scoring-Ranking-Screening Performances of Protein-Ligand Scoring Functions.

Authors:  Chao Yang; Yingkai Zhang
Journal:  J Chem Inf Model       Date:  2022-05-17       Impact factor: 6.162

4.  Improving protein-ligand docking and screening accuracies by incorporating a scoring function correction term.

Authors:  Liangzhen Zheng; Jintao Meng; Kai Jiang; Haidong Lan; Zechen Wang; Mingzhi Lin; Weifeng Li; Hongwei Guo; Yanjie Wei; Yuguang Mu
Journal:  Brief Bioinform       Date:  2022-05-13       Impact factor: 13.994

5.  Sfcnn: a novel scoring function based on 3D convolutional neural network for accurate and stable protein-ligand affinity prediction.

Authors:  Yu Wang; Zhengxiao Wei; Lei Xi
Journal:  BMC Bioinformatics       Date:  2022-06-08       Impact factor: 3.307

6.  Machine-learning scoring functions trained on complexes dissimilar to the test set already outperform classical counterparts on a blind benchmark.

Authors:  Hongjian Li; Gang Lu; Kam-Heung Sze; Xianwei Su; Wai-Yee Chan; Kwong-Sak Leung
Journal:  Brief Bioinform       Date:  2021-11-05       Impact factor: 11.622

7.  Improving Docking Power for Short Peptides Using Random Forest.

Authors:  Michel F Sanner; Leonard Dieguez; Stefano Forli; Ewa Lis
Journal:  J Chem Inf Model       Date:  2021-06-14       Impact factor: 6.162

8.  Learning protein-ligand binding affinity with atomic environment vectors.

Authors:  Rocco Meli; Andrew Anighoro; Mike J Bodkin; Garrett M Morris; Philip C Biggin
Journal:  J Cheminform       Date:  2021-08-14       Impact factor: 5.514

9.  Prediction of Binding Free Energy of Protein-Ligand Complexes with a Hybrid Molecular Mechanics/Generalized Born Surface Area and Machine Learning Method.

Authors:  Lina Dong; Xiaoyang Qu; Yuan Zhao; Binju Wang
Journal:  ACS Omega       Date:  2021-11-21

10.  Machine learning-assisted non-destructive plasticizer identification and quantification in historical PVC objects based on IR spectroscopy.

Authors:  Tjaša Rijavec; David Ribar; Jernej Markelj; Matija Strlič; Irena Kralj Cigić
Journal:  Sci Rep       Date:  2022-03-23       Impact factor: 4.379

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.