| Literature DB >> 29086119 |
Tong He1, Marten Heidemeyer1, Fuqiang Ban2, Artem Cherkasov2, Martin Ester3.
Abstract
Computational prediction of the interaction between drugs and targets is a standing challenge in the field of drug discovery. A number of rather accurate predictions were reported for various binary drug-target benchmark datasets. However, a notable drawback of a binary representation of interaction data is that missing endpoints for non-interacting drug-target pairs are not differentiated from inactive cases, and that predicted levels of activity depend on pre-defined binarization thresholds. In this paper, we present a method called SimBoost that predicts continuous (non-binary) values of binding affinities of compounds and proteins and thus incorporates the whole interaction spectrum from true negative to true positive interactions. Additionally, we propose a version of the method called SimBoostQuant which computes a prediction interval in order to assess the confidence of the predicted affinity, thus defining the Applicability Domain metrics explicitly. We evaluate SimBoost and SimBoostQuant on two established drug-target interaction benchmark datasets and one new dataset that we propose to use as a benchmark for read-across cheminformatics applications. We demonstrate that our methods outperform the previously reported models across the studied datasets.Entities:
Keywords: Applicability Domain; Drug–target interaction; Gradient boosting; Prediction interval; QSAR; Read-across
Year: 2017 PMID: 29086119 PMCID: PMC5395521 DOI: 10.1186/s13321-017-0209-z
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1The workflow of SimBoost and SimBoostQuant
Structure of feature vector for
| Type 1 of | Type 1 of | Type 2 of | Type 2 of | Type 3 of |
The statistics of the three datasets
| Dataset | Number of drugs | Number of targets | Density (%) |
|---|---|---|---|
|
| 68 | 442 | 100 |
|
| 1421 | 156 | 42.1 |
|
| 2116 | 229 | 24.4 |
Fig. 2Distribution of values in the three datasets (Davis, Metz and KIBA from left to right) and binarization thresholds (vertical red line)
Results on the Davis data set, with the mean and standard deviation from 10 repetitions
| RMSE | AUC | AUPR | CI | |
|---|---|---|---|---|
| MF | 0.509 ± 0.010 | 0.876 ± 0.004 | 0.499 ± 0.017 | 0.816 ± 0.004 |
| Continuous | 0.608 ± 0.002 | 0.942 ± 0.001 | 0.679 ± 0.003 | 0.860 ± 0.001 |
| Binary | – | 0.931 ± 0.001 | 0.686 ± 0.006 | – |
|
|
|
|
|
|
|
| 0.36 ± 0.001 | 0.942 ± 0.002 | 0.680 ± 0.002 | 0.871 ± 0.004 |
Results on the Metz data set, with the mean and standard deviation from 10 repetitions
| RMSE | AUC | AUPR | CI | |
|---|---|---|---|---|
| MF | 0.303 ± 0.005 | 0.895 ± 0.003 | 0.358 ± 0.011 | 0.788 ± 0.001 |
| Continuous | 0.562 ± 0.001 | 0.943 ± 0.001 | 0.518 ± 0.003 | 0.789 ± 0.001 |
| Binary | – | 0.932 ± 0.001 | 0.565 ± 0.004 | – |
|
|
|
|
|
|
|
| 0.249 ± 0.002 | 0.942 ± 0.002 | 0.523 ± 0.004 | 0.813 ± 0.020 |
Results on the KIBA data set, with the mean and standard deviation from 10 repetitions
| RMSE | AUC | AUPR | CI | |
|---|---|---|---|---|
| MF | 0.382 ± 0.003 | 0.831 ± 0.002 | 0.631 ± 0.004 | 0.792 ± 0.001 |
| Continuous | 0.620 ± 0.001 | 0.884 ± 0.001 | 0.735 ± 0.001 | 0.792 ± 0.001 |
| Binary | – | 0.904 ± 0.001 | 0.7660 ± 0.001 | – |
|
|
|
|
|
|
|
| 0.299 ± 0.001 | 0.875 ± 0.001 | 0.708 ± 0.002 | 0.796 ± 0.001 |
Fig. 3Prediction from SimBoost against the real values on Davis
Fig. 4Prediction from SimBoost against the real values on Metz
Fig. 5Prediction from SimBoost against the real values on KIBA
Fig. 6The prediction intervals of two targets from KIBA
Fig. 7The relationship between the number of observations and the average width of the prediction intervals, in the KIBA dataset
Fig. 8Relative feature importance in Davis