| Literature DB >> 29538331 |
Hongjian Li1,2,3, Jiangjun Peng4,5, Yee Leung6, Kwong-Sak Leung7,8, Man-Hon Wong9, Gang Lu10, Pedro J Ballester11,12,13,14.
Abstract
It has recently been claimed that the outstanding performance of machine-learning scoring functions (SFs) is exclusively due to the presence of training complexes with highly similar proteins to those in the test set. Here, we revisit this question using 24 similarity-based training sets, a widely used test set, and four SFs. Three of these SFs employ machine learning instead of the classical linear regression approach of the fourth SF (X-Score which has the best test set performance out of 16 classical SFs). We have found that random forest (RF)-based RF-Score-v3 outperforms X-Score even when 68% of the most similar proteins are removed from the training set. In addition, unlike X-Score, RF-Score-v3 is able to keep learning with an increasing training set size, becoming substantially more predictive than X-Score when the full 1105 complexes are used for training. These results show that machine-learning SFs owe a substantial part of their performance to training on complexes with dissimilar proteins to those in the test set, against what has been previously concluded using the same data. Given that a growing amount of structural and interaction data will be available from academic and industrial sources, this performance gap between machine-learning SFs and classical SFs is expected to enlarge in the future.Keywords: binding affinity prediction; machine learning; molecular docking; scoring function
Mesh:
Year: 2018 PMID: 29538331 PMCID: PMC5871981 DOI: 10.3390/biom8010012
Source DB: PubMed Journal: Biomolecules ISSN: 2218-273X
Figure 1Test set performance of scoring functions (SFs) trained with nested datasets at different protein structural similarity cutoffs. Each of the four SFs (see legend at the right bottom) was trained with 13 nested datasets (i.e., a larger dataset includes all the complexes from the smaller datasets), leading to 13 implementations of each SF. This method to generate datasets was introduced by Li and Yang [9] to include training complexes with increasingly similar proteins to those in the test set (the smaller the data set is, the lower the applied protein structural similarity cutoff was). However, to look deeper into these questions, we incorporated two smaller training sets and also included in the comparison two additional machine-learning SFs (RF::X-Score and RF-Score-v3). For each SF implementation, the performance was calculated as the Pearson correlation between the predicted and the measured binding affinities for the 195 diverse protein–ligand complexes in the test set. We can see that X-Score’s performance levels off with as little as 116 training complexes, hence being unable to exploit the most similar complexes. By contrast, RF-Score-v3 keeps learning, outperforming X-Score with training sets larger than just 371 complexes. Note the large performance gap between machine-learning SFs and X-Score when the full 1105 complexes were all used for training. Abbreviations: RF, random forest; MLR, multiple linear regression.
Figure 2Test set performance of SFs trained with nested data sets at different protein sequence similarity cutoffs. (a) Instead of protein structural similarity (Figure 1), sequence similarity cutoffs were here employed to build 11 nested training sets, including two training sets smaller than those previously used in [9]. In addition, we also included in the comparison two additional machine-learning SFs (RF::X-Score and RF-Score-v3). Each of the four SFs was trained with these nested data sets, leading to 11 implementations of each SF. Each of the resulting 44 SF implementations was tested against the same 195 diverse protein–ligand complexes in the test set. Analogous conclusions were reached in this case. X-Score cannot benefit from more than 181 training samples, whereas its RF variant (denoted as RF::X-Score) and both RF-Score and RF-Score-v3 keep learning and increasing the correlation of their predicted binding affinities to the experimentally measured values and ultimately surpassed X-Score. (b) This plot reproduces the relevant part of the graphical abstract of Li and Yang’s paper. Note that, to label one training set complex as similar, it was enough that a single test set complex had an above-cutoff similarity to the training complex. Because the test set was generated by picking three representatives of each of the 65 sequence-based clusters, the vast majority of the remaining test complexes would tend to contain a dissimilar protein to that of the similar training complex.