| Literature DB >> 30564994 |
Lucia Fusani1, Alvaro Cortes Cabrera2.
Abstract
The COMBINE method was designed to study congeneric series of compounds including structural information of ligand-protein complexes. Although very successful, the method has not received the same level of attention than other alternatives to study Quantitative Structure Active Relationships (QSAR) mainly because lack of ways to measure the uncertainty of the predictions and the need for large datasets. Active learning, a semi-supervised learning approach that makes use of uncertainty to enhance models' performance while reducing the size of the training sets, has been used in this work to address both problems. We propose two estimators of uncertainty: the pool of regressors and the distance to the training set. The performance of the methods has been evaluated by testing the resulting active learning workflows in 3 diverse datasets: HIV-1 protease inhibitors, Taxol-derivatives and BRD4 inhibitors. The proposed strategies were successful in 80% of the cases for the taxol-derivatives and BRD4 inhibitors, while outperformed random selection in the case of the HIV-1 protease inhibitors time-split. Our results suggest that AL-COMBINE might be an effective way of producing consistently superior QSAR models with a limited number of samples.Entities:
Keywords: Active learning; BRD4; COMBINE; HIV; Protease; QSAR; Regression; Taxanes
Mesh:
Substances:
Year: 2018 PMID: 30564994 PMCID: PMC7087723 DOI: 10.1007/s10822-018-0181-3
Source DB: PubMed Journal: J Comput Aided Mol Des ISSN: 0920-654X Impact factor: 3.686
Fig. 1Diagram of the different Active Learning strategies employed in this work. a General workflow; b Pool of regressors; c Distance to the training set
Results of the full COMBINE HIV-PR model validation
| HIV-PR from Perez et al. COMBINE AMBER model [ | HIV-PR inhibitors | Taxanesa | BRD4-BD1 inhibitorsa | |
|---|---|---|---|---|
| r2 | 0.89 | 0.85 | 0.94 | 0.81 |
| q2 | 0.70 | 0.77 | 0.60 | 0.56 |
| SDEPcv | 0.72 | 0.63 | 0.91 | 0.36 |
| SDEPext | 0.83 | 0.82 | – | – |
| r2ext | – | 0.78 | – | – |
aAverages using a 80%/20% training/test sets split and 20 times repeats. HIV-PR values were obtained as described in the text
Fig. 2Performance of the active learning strategies in the HIV-1 protease inhibitors set. a Mean squared error at each iteration for the pool of regressors strategy vs. random selection and the full model. b Coefficient of determination at each iteration for the pool of regressors strategy vs. random selection and the full model. c Mean squeared error at each iteration for the distance to the training set strategy vs. random selection and the full model. d Coefficient of determination at each iteration for the distance to the training set strategy vs. random selection and the full model
Fig. 3Evolution of the models for each iteration in the HIV-PR simulation. a Pool of regressors. b Random selection. Green lines represent y = x; red lines mark ± 0.5 units from the green line (y = x + 0.5 and y = x-0.5); The resulting regression between the predicted values for the samples (orange dots) and the experimental values is plotted in blue