| Literature DB >> 33430958 |
Noureddin Sadawi1,2, Ivan Olier3, Joaquin Vanschoren4, Jan N van Rijn5, Jeremy Besnard6,7, Richard Bickerton6,7, Crina Grosan2, Larisa Soldatova8,9, Ross D King10.
Abstract
The goal of quantitative structure activity relationship (QSAR) learning is to learn a function that, given the structure of a small molecule (a potential drug), outputs the predicted activity of the compound. We employed multi-task learning (MTL) to exploit commonalities in drug targets and assays. We used datasets containing curated records about the activity of specific compounds on drug targets provided by ChEMBL. Totally, 1091 assays have been analysed. As a baseline, a single task learning approach that trains random forest to predict drug activity for each drug target individually was considered. We then carried out feature-based and instance-based MTL to predict drug activities. We introduced a natural metric of evolutionary distance between drug targets as a measure of tasks relatedness. Instance-based MTL significantly outperformed both, feature-based MTL and the base learner, on 741 drug targets out of 1091. Feature-based MTL won on 179 occasions and the base learner performed best on 171 drug targets. We conclude that MTL QSAR is improved by incorporating the evolutionary distance between targets. These results indicate that QSAR learning can be performed effectively, even if little data is available for specific drug targets, by leveraging what is known about similar drug targets.Entities:
Keywords: Multi-task learning; Quantitative structure activity relationship; Random forest; Sequence-based similarity
Year: 2019 PMID: 33430958 PMCID: PMC6852942 DOI: 10.1186/s13321-019-0392-1
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
A typical QSAR dataset
| MOL_ID | FP_1 | FP_2 | ... | FP_n | Activity |
|---|---|---|---|---|---|
| ID_1 | 1 | 0 | ... | 1 | 6.351 |
| ID_2 | 0 | 1 | ... | 0 | 7.534 |
| ... | ... | ... | ... | ... | ... |
| ID_22 | 1 | 1 | ... | 1 | 8.001 |
| ID_23 | 0 | 1 | ... | 0 | 6.239 |
An input dataset for feature-based MTL
| MOL_ID | TID | FP_1 | FP_2 | ... | Activity |
|---|---|---|---|---|---|
| ID_1 | 7 | 1 | 0 | ... | 6.351 |
| ID_2 | 7 | 0 | 1 | ... | 7.534 |
| ... | ... | ... | ... | ... | ... |
| ID_111 | 95 | 1 | 1 | ... | 8.001 |
| ID_112 | 95 | 0 | 1 | ... | 6.239 |
An output table for feature-based MTL
| FOLD | MOL_ID | TID | Activity | Prediction |
|---|---|---|---|---|
| 1 | ID_1 | 7 | 6.351 | 6.011 |
| 1 | ID_2 | 7 | 7.534 | 7.681 |
| ... | ... | ... | ... | ... |
| 10 | ID_111 | 95 | 8.001 | 7.764 |
| 10 | ID_112 | 95 | 6.239 | 6.401 |
A dataset for instance-based MTL
| MOL_ID | TID | SimToTID_7 | ... | SimToTID_95 | FP_1 | FP_2 | ... | Activity |
|---|---|---|---|---|---|---|---|---|
| ID_1 | 7 | 1 | ... | 0.584 | 1 | 0 | ... | 6.351 |
| ID_2 | 7 | 1 | ... | 0.584 | 0 | 1 | ... | 7.534 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ID_111 | 95 | 0.584 | ... | 1 | 1 | 1 | ... | 8.001 |
| ID_112 | 95 | 0.584 | ... | 1 | 1 | 1 | ... | 6.239 |
ChEMBL’s 6-level protein family classification
| Level | No of classes |
|---|---|
| L1 | 13 |
| L2 | 24 |
| L3 | 46 |
| L4 | 111 |
| L5 | 180 |
| L6 | 50 |
Pair-wise sign test for the L5 results
| Setting | # +ve | # −ve | # ties |
|---|---|---|---|
| Feature-based MTL vs STL | 686 | 405 | 0 |
| Instance-based MTL vs STL | 911 | 180 | 0 |
| Instance-based MTL vs feature-based MTL | 891 | 200 | 0 |
Fig. 1The number of drug targets each method scores the lowest RMSE value
Pair-wise Wilcoxon signed-ranks test for L5 results (W is the test statistic)
| Setting | W | p-value |
|---|---|---|
| STL vs feature-based MTL medians: 0.744 and 0.701 | 374646 | 1.609e−13 |
| STL vs instance-based MTL medians: 0.744 and 0.633 | 535197 | 2.2e−16 |
| Feature-based MTL vs instance-based MTL medians: 0.701 and 0.633 | 535673 | 2.2e−16 |
Fig. 2Feature-based and instance-based MTL compared with STL (ranked from 3 to 1) using L5 classes
Fig. 3The number of drug targets each method scored the highest R-squared value
Fig. 4Boxplot of RMSE values for the three settings when applied to all L5 drug target classes
Fig. 5Barplot of RMSE values for 21 DHFR drug targets
Pair-wise Wilcoxon signed-ranks test for the 21 DHFR group results (W is the test statistic)
| Setting | W | p-value |
|---|---|---|
| STL vs feature-based MTL medians: 0.821 and 0.808 | 108 | 0.8117 |
| STL vs instance-based MTL medians: 0.821 and 0.668 | 222 | 3.147e−05 |
| Feature-based MTL vs instance-based MTL medians: 0.808 and 0.668 | 220 | 5.245e−05 |
Fig. 6Boxplot of RMSE values for the three settings when applied to 21 DHFR drug targets