| Literature DB >> 35047110 |
Xiaolin Sun1, Ryo Tamura1,2,3,4, Masato Sumita3,4, Kenichi Mori5, Kei Terayama4,6, Koji Tsuda1,2,4.
Abstract
A large amount of bioactivity assay data is already accumulated in public databases, but the integration of these data sets for quantitative structure-activity relationship (QSAR) studies is not straightforward due to differences in experimental methods and settings. We present an efficient deep-learning-based approach called Deep Preference Data Integration (DPDI). For integrating outcome variables of different assay types, a surrogate variable is introduced, and a neural network is trained such that the total order induced by the surrogate variable is maximally consistent with given data sets. In a task of predicting efficacy of factor Xa inhibitors, DPDI successfully integrated 2959 molecules distributed in 129 assay data sets. In most of our experiments, data integration improved prediction accuracy strongly in interpolation and extrapolation tasks, indicating that DPDI is an effective tool for QSAR studies.Entities:
Year: 2021 PMID: 35047110 PMCID: PMC8762726 DOI: 10.1021/acsmedchemlett.1c00439
Source DB: PubMed Journal: ACS Med Chem Lett ISSN: 1948-5875 Impact factor: 4.345
Figure 1Data integration with a surrogate variable. (a) For ligands A–G, the outcome values for two different assay types are shown. (b) The first and second rows show the rankings according to the outcome values of corresponding assay types. The third row shows the ranking due to the surrogate values predicted with a neural network.
Figure 2Experimental details. In learning with integrated data set (shown as integrate), the main data set and external data set are independently converted to preferences. After DPDI is trained with the preferences, the candidate molecules can be converted to surrogate values. After converting the surrogate values to preferences, it is compared with the true ranking. Normalized discounted cumulative gain (NDCG) is used as the accuracy measure. In direct mix, the main data set and external data set are used as they are. A fully connected network is trained by minimizing the mean squared loss (MSE) with both data sets, and the activity values of the candidate molecules are induced. After they are converted to preferences, NDCG is used to measure the accuracy.
List of ChEMBL Assay Data Sets Used as the Main Data Seta
| NDCG
(mean ± STD) | |||||||
|---|---|---|---|---|---|---|---|
| interpolation | extrapolation | ||||||
| main data set | size | source (document year) | single | integrated | direct mix | single | integrated |
| CHEMBL3885775 | 56 | K4DD project | 0.66 ± 0.21 | 0.85 ± 0.17 | 0.63 ± 0.23 | 0.41 ± 0.14 | 0.36 ± 0.12 |
| CHEMBL968695 | 55 | scientific literature (2009) | 0.62 ± 0.15 | 0.65 ± 0.16 | 0.61 ± 0.15 | 0.35 ± 0.15 | 0.43 ± 0.17 |
| CHEMBL3885768 | 55 | K4DD project | 0.54 ± 0.14 | 0.82 ± 0.18 | 0.59 ± 0.22 | 0.37 ± 0.09 | 0.41 ± 0.08 |
| CHEMBL659609 | 62 | scientific literature (2004) | 0.81 ± 0.17 | 0.78 ± 0.18 | 0.57 ± 0.16 | 0.24 ± 0.06 | 0.46 ± 0.20 |
| CHEMBL885070 | 46 | scientific literature (2002) | 0.81 ± 0.19 | 0.84 ± 0.17 | 0.54 ± 0.19 | 0.33 ± 0.23 | 0.42 ± 0.20 |
| CHEMBL3885772 | 55 | K4DD project | 0.53 ± 0.15 | 0.80 ± 0.19 | 0.50 ± 0.15 | 0.30 ± 0.09 | 0.46 ± 0.08 |
Test accuracies in different experimental settings are summarized. For information about the K4DD project, see Schuetz et al.[18] The sources of CHEMBL968695, CHEMBL959609, and CHEMBL885070 are Zhang et al.,[6] Jia et al.,[19] and Zhang et al.,[20] respectively.
Figure 3Results of interpolation experiments.
Figure 4Results of extrapolation experiments.
Figure 5(a) Accuracy of Gaussian process, DPDI, and rankSVM in interpolation experiments for ChEMBL3885765. (b) Computational time of Gaussian process, DPDI, and rankSVM.