Literature DB >> 35047110

Integrating Incompatible Assay Data Sets with Deep Preference Learning.

Xiaolin Sun1, Ryo Tamura1,2,3,4, Masato Sumita3,4, Kenichi Mori5, Kei Terayama4,6, Koji Tsuda1,2,4.   

Abstract

A large amount of bioactivity assay data is already accumulated in public databases, but the integration of these data sets for quantitative structure-activity relationship (QSAR) studies is not straightforward due to differences in experimental methods and settings. We present an efficient deep-learning-based approach called Deep Preference Data Integration (DPDI). For integrating outcome variables of different assay types, a surrogate variable is introduced, and a neural network is trained such that the total order induced by the surrogate variable is maximally consistent with given data sets. In a task of predicting efficacy of factor Xa inhibitors, DPDI successfully integrated 2959 molecules distributed in 129 assay data sets. In most of our experiments, data integration improved prediction accuracy strongly in interpolation and extrapolation tasks, indicating that DPDI is an effective tool for QSAR studies.
© 2021 The Authors. Published by American Chemical Society.

Entities:  

Year:  2021        PMID: 35047110      PMCID: PMC8762726          DOI: 10.1021/acsmedchemlett.1c00439

Source DB:  PubMed          Journal:  ACS Med Chem Lett        ISSN: 1948-5875            Impact factor:   4.345


A large number of bioassay data sets are accumulated in public databases, but their use is limited due to differences in experimental methods and settings. We propose a new deep learning model called Deep Preference Data Integration (DPDI) to enable the integration of incompatible data sets. Our method increases the value of public data sets by providing the means to reuse them. In quantitative structure–activity relationship (QSAR) studies, researchers are interested in investigating structural features of molecules that determine their bioactivities. Machine learning models are an essential part of QSAR studies, where bioactivities of a large number of molecules are induced from training examples. To maximize the size of a training set, one may consider combining multiple bioactivity assay data sets deposited in public databases such as ChEMBL and PubChem Bioassay.[1] The use of multiple data sets is, however, limited to only a few cases.[2−4] One of the main reasons lies in incompatibility of these data sets. Even if biological activities are represented in the same unit such as IC50, the combination of these data sets may not lead to improvement of prediction accuracy in machine learning due to the differences in experimental methods and conditions. For example, let us take the two bioactive assay data sets, ChEMBL968695 and ChEMBL3885775. In both assays, the target protein is factor Xa, a protease involved in the blood coagulation pathway.[5] It acts by cleaving prothrombin in two places, which yields active thrombin. The first assay data set is obtained by the human plasma-based thrombin generation test, where the activity is measured by the amount of thrombin, the product of factor Xa, in human plasma.[6] The second data set is obtained by the biochemical assay using fluorogenic peptide substrate.[7] A fluorogenic peptide substrate consists of a peptide that factor Xa can cleave and a fluorophore. The substrate is normally not fluorescent, but fluorescence is restored, when factor Xa cleaves off the fluorophore. Using this method, one can measure the activity of factor Xa by measuring the fluorescence intensity. Measurements from completely different assay types, as exemplified above, cannot be compared directly, and mixing such data without any treatment may be harmful for machine learning. Assume that n ligands are represented as d-dimensional fingerprints x1,...,x ∈ {0,1}. Denote by the outcome of ligand i for assay type j. Typically, some of the values of the outcome variables are not available (Figure a). One possible way to integrate such data sets is multitask learning,[8] where a machine learning model is trained to predict all outcomes from a fingerprint. However, the number of available ligands for an assay type can be extremely small (e.g., 2 or 3), hence accurate prediction of all the outcome variables would not be feasible. In this paper, we consider a virtual outcome variable ŷ and call it the surrogate variable. A neural network model is trained to predict the surrogate variable from a fingerprint such that the total order induced by the surrogate variable conforms to all available data. To this aim, each data set is represented as a set of pairwise preferences (i.e., larger-than relationship, ≻). For example, assay type 1 in Figure b is represented as C ≻ A, B ≻ C, and B ≻ A. The neural network is trained to minimize the number of preferences contradicting the total order by the surrogate variable. When a new ligand is given, the neural network predicts its surrogate value. A user can place the new ligand in the ranking of any assay type to understand how promising it is.
Figure 1

Data integration with a surrogate variable. (a) For ligands A–G, the outcome values for two different assay types are shown. (b) The first and second rows show the rankings according to the outcome values of corresponding assay types. The third row shows the ranking due to the surrogate values predicted with a neural network.

Data integration with a surrogate variable. (a) For ligands A–G, the outcome values for two different assay types are shown. (b) The first and second rows show the rankings according to the outcome values of corresponding assay types. The third row shows the ranking due to the surrogate values predicted with a neural network. In the literature,[9,10] it is reported that the accuracy of machine learning models depends on the domain of applicability, i.e., the outcome range that encloses training examples. Machine learning is powerful in interpolation (i.e., prediction for test examples within the domain) but poor in extrapolation (i.e., prediction for those outside the domain). Notably, DPDI often expands the domain of applicability. For assay type 1 in Figure b, the domain of applicability is from A to B. When both assay types are integrated into the surrogate values, the domain is expanded to from A to G. Since a ligand better than the known ones is always wanted in virtual screening, extrapolation is more important than interpolation. In our computational experiments with 129 ChEMBL assay data sets, we observed strong improvement of extrapolation accuracy as a result of data integration. On the other hand, simple mixing of assay data sets resulted in accuracy deterioration. This result demonstrates that DPDI can overcome differences in assay types and enables the effective use of public data for better virtual screening. In addition, DPDI was shown to be more scalable in comparison to an alternative Gaussian-process-based preference learning model,[11] indicating that DPDI can be applied to large-scale projects without difficulty. In DPDI, a fully connected neural network[12] is employed to predict the surrogate value from a fingerprint. Throughout this paper, 300 dimensional Mol2vec fingerprints[13] are used due to high expression ability. Shibayama et al.[21] reported better prediction performances of Mol2vec in comparison to existing fingerprints. The hyperparameters of the network are adjusted using a black-box optimization software, Optuna.[14] The hyperparameters and their ranges are as follows: the number of hidden layers (1–5), the number of units in each layer (4–1024), learning rate (0.0001–0.1), dropout rate (0–0.4), and optimizer type (Adam or stochastic gradient descent). Each assay data set is converted to pairwise preferences and summarized into one training set D = {u ≻ v}, where u and v are indices of ligands and M is the total number of preferences. Since there are multiple assays, it is possible that the same pair of ligands appears multiple times. Let ŷ denote the surrogate value of ligand i. We would like to train the network to minimize a loss function that represents the number of training examples contradicting the order induced by the surrogate variable. To make the neural network trainable, however, a loss function has to be differentiable. To this aim, the number of contradicting examples is approximated by the following cross entropy function,where P(u ≻ (v) is defined via surrogate values asand I(·) is the indicator function that returns 1 if the condition inside the parentheses is satisfied and otherwise 0. We collected 129 bioactivity assay data sets about factor Xa from ChEMBL database (Supporting Information). We chose factor Xa, because of its clinical importance and availability of quite a few data sets in public databases. Factor Xa is a target for the development of new anticoagulants for the treatment of pathologic arterial and venous thrombosis.[22] Each data set contains from 2 to 85 ligands, and the total number of ligands is 5929. A data set is selected as main data set, which is then divided into training, validation, and test sets in the fraction of 3:1:1. The validation set is kept aside to monitor the loss during the neural network training and hyperparameter tuning. In this section, the data set is divided randomly to test DPDI’s interpolation performance. We compared the following three scenarios: In one scenario called learning with single data set, only the training set taken from the main data set is used. In the second scenario called learning with integrated data set, the training set from the main data set is integrated with all the other 128 assay data sets via DPDI. In the third scenario called direct mix, the training set from the main data set is simply mixed with all the other data sets without any treatment. A fully connected neural network is trained with the squared loss function. Hyperparameter tuning is performed in the same way as DPDI. See Figure for experimental details.
Figure 2

Experimental details. In learning with integrated data set (shown as integrate), the main data set and external data set are independently converted to preferences. After DPDI is trained with the preferences, the candidate molecules can be converted to surrogate values. After converting the surrogate values to preferences, it is compared with the true ranking. Normalized discounted cumulative gain (NDCG) is used as the accuracy measure. In direct mix, the main data set and external data set are used as they are. A fully connected network is trained by minimizing the mean squared loss (MSE) with both data sets, and the activity values of the candidate molecules are induced. After they are converted to preferences, NDCG is used to measure the accuracy.

Experimental details. In learning with integrated data set (shown as integrate), the main data set and external data set are independently converted to preferences. After DPDI is trained with the preferences, the candidate molecules can be converted to surrogate values. After converting the surrogate values to preferences, it is compared with the true ranking. Normalized discounted cumulative gain (NDCG) is used as the accuracy measure. In direct mix, the main data set and external data set are used as they are. A fully connected network is trained by minimizing the mean squared loss (MSE) with both data sets, and the activity values of the candidate molecules are induced. After they are converted to preferences, NDCG is used to measure the accuracy. In all scenarios, the test accuracy is computed by comparing the ranking due to predicted surrogate values against the ground-truth ranking. As the accuracy measure, we employed normalized discounted cumulative gain (NDCG).[15,16] Let us assume that the entity ranked at ith position in the ground-truth ranking is ranked at R(i)th position in the predicted ranking. Discounted cumulative gain (DCG) is defined aswhere c is the number of all entities. NDCG is the ratio of DCG to its maximum possible value. It is one if the two rankings match completely, and a lower value indicates poorer match. Computational experiments are performed with each of six assay data sets listed in Table designated as the main data set. These are the largest ones among all the data sets. Figure shows the distribution of test accuracy for 50 different data divisions. Their summary statistics are shown in Table as well. Notably, the test accuracy of the direct mix scenario was worse than that of the single data scenario in most cases. This result illustrates the difficulty of data integration due to differences in assay types. Comparing single and integrated data sets, the accuracy improved in five out of six cases, indicating the DPDI makes the effective use of additional information included in other data sets.
Table 1

List of ChEMBL Assay Data Sets Used as the Main Data Seta

   NDCG (mean ± STD)
   interpolation
extrapolation
main data setsizesource (document year)singleintegrateddirect mixsingleintegrated
CHEMBL388577556K4DD project0.66 ± 0.210.85 ± 0.170.63 ± 0.230.41 ± 0.140.36 ± 0.12
CHEMBL96869555scientific literature (2009)0.62 ± 0.150.65 ± 0.160.61 ± 0.150.35 ± 0.150.43 ± 0.17
CHEMBL388576855K4DD project0.54 ± 0.140.82 ± 0.180.59 ± 0.220.37 ± 0.090.41 ± 0.08
CHEMBL65960962scientific literature (2004)0.81 ± 0.170.78 ± 0.180.57 ± 0.160.24 ± 0.060.46 ± 0.20
CHEMBL88507046scientific literature (2002)0.81 ± 0.190.84 ± 0.170.54 ± 0.190.33 ± 0.230.42 ± 0.20
CHEMBL388577255K4DD project0.53 ± 0.150.80 ± 0.190.50 ± 0.150.30 ± 0.090.46 ± 0.08

Test accuracies in different experimental settings are summarized. For information about the K4DD project, see Schuetz et al.[18] The sources of CHEMBL968695, CHEMBL959609, and CHEMBL885070 are Zhang et al.,[6] Jia et al.,[19] and Zhang et al.,[20] respectively.

Figure 3

Results of interpolation experiments.

Results of interpolation experiments. Test accuracies in different experimental settings are summarized. For information about the K4DD project, see Schuetz et al.[18] The sources of CHEMBL968695, CHEMBL959609, and CHEMBL885070 are Zhang et al.,[6] Jia et al.,[19] and Zhang et al.,[20] respectively. Next, we tested extrapolation performance of DPDI. To simulate extrapolation, the main data set is divided as follows. First, the test set is designated as the ligands with top 20% outcome. The rest is randomly divided into training and validation data sets in the fraction of 3:1. Figure and Table show the distribution and summary statistics of test accuracy, respectively. First of all, the test accuracy is significantly lower than that in interpolation. It indicates that extrapolation is a much more difficult task than interpolation. In five out of six cases, the integrated data scenario by DPDI outperformed the single data scenario. For CHEMBL659609, the improvement is dramatic; i.e., the average test accuracy is almost doubled. This result implies that DPDI can help extrapolation by expanding the domain of applicability.
Figure 4

Results of extrapolation experiments.

Results of extrapolation experiments. We compare DPDI with two existing methods for preference-based data integration. One is the Gaussian process-based approach by Sun et al.,[11] and the other is the linear support vector machine (SVM)-based approach (rankSVM) by Matsumoto et al.[4] Due to the high computational cost of the Gaussian process model, we conducted a scaled-down interpolation experiment of integrating ChEMBL3885775 with five other data sets listed in Table . Figure a shows the accuracy of the three methods. The accuracy of DPDI was highest, indicating superior modeling ability of deep neural networks. Computational time for training from preference data is summarized in Figure b. RankSVM is a linear model and most efficient to train. As in most deep learning models, DPDI showed linear growth as the number of preferences increases. Gaussian process was particularly slow, showing superlinear growth. Among the three methods, DPDI achieved high standards both in accuracy and scalability.
Figure 5

(a) Accuracy of Gaussian process, DPDI, and rankSVM in interpolation experiments for ChEMBL3885765. (b) Computational time of Gaussian process, DPDI, and rankSVM.

(a) Accuracy of Gaussian process, DPDI, and rankSVM in interpolation experiments for ChEMBL3885765. (b) Computational time of Gaussian process, DPDI, and rankSVM. We presented a deep learning approach, DPDI, for integrating multiple bioassay data sets. The significance of our method is that public data sets, otherwise useless, can be used to our advantage. DPDI converts data sets to a set of preferences. A favorable point of using preferences is that one can use bioassay data sets without any preprocessing. To derive clinically useful information from bioactivity data, researchers search for molecular substructures related to the bioactivity by statistical analysis.[17] By integrating multiple data sets into surrogate values, the number of samples used in statistical analysis is increased, leading to more conclusive results. A possible drawback of DPDI is that the user receives prediction results in the form of ranking, not an exact outcome value. We anticipate that this point does not affect scientists’ decision making, because assay outcomes are always error prone and small changes may not be critically important. To serve the community, we made our PyTorch-based code publicly available at https://github.com/tsudalab/PrefIntNN.
  15 in total

Review 1.  Making every SAR point count: the development of Chemistry Connect for the large-scale integration of structure and bioactivity data.

Authors:  Sorel Muresan; Plamen Petrov; Christopher Southan; Magnus J Kjellberg; Thierry Kogej; Christian Tyrchan; Peter Varkonyi; Paul Hongxing Xie
Journal:  Drug Discov Today       Date:  2011-10-14       Impact factor: 7.851

Review 2.  Inhibition of Factor Xa : a potential target for the development of new anticoagulants.

Authors:  John H Alexander; Kanwar P Singh
Journal:  Am J Cardiovasc Drugs       Date:  2005       Impact factor: 3.571

3.  QSAR Modeling Using Large-Scale Databases: Case Study for HIV-1 Reverse Transcriptase Inhibitors.

Authors:  Olga A Tarasova; Aleksandra F Urusova; Dmitry A Filimonov; Marc C Nicklaus; Alexey V Zakharov; Vladimir V Poroikov
Journal:  J Chem Inf Model       Date:  2015-06-29       Impact factor: 4.956

Review 4.  Kinetics for Drug Discovery: an industry-driven effort to target drug residence time.

Authors:  Doris A Schuetz; Wilhelmus Egbertus Arnout de Witte; Yin Cheong Wong; Bernhard Knasmueller; Lars Richter; Daria B Kokh; S Kashif Sadiq; Reggie Bosma; Indira Nederpelt; Laura H Heitman; Elena Segala; Marta Amaral; Dong Guo; Dorothee Andres; Victoria Georgi; Leigh A Stoddart; Steve Hill; Robert M Cooke; Chris De Graaf; Rob Leurs; Matthias Frech; Rebecca C Wade; Elizabeth Cunera Maria de Lange; Adriaan P IJzerman; Anke Müller-Fahrnow; Gerhard F Ecker
Journal:  Drug Discov Today       Date:  2017-04-13       Impact factor: 7.851

5.  Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition.

Authors:  Sabrina Jaeger; Simone Fulle; Samo Turk
Journal:  J Chem Inf Model       Date:  2018-01-10       Impact factor: 4.956

6.  Design, synthesis, and SAR of monobenzamidines and aminoisoquinolines as factor Xa inhibitors.

Authors:  Penglie Zhang; Jingmei F Zuckett; John Woolfrey; Katherine Tran; Brian Huang; Paul Wong; Uma Sinha; Gary Park; Andrea Reed; John Malinowski; Stan Hollenbach; Robert M Scarborough; Bing-Yan Zhu
Journal:  Bioorg Med Chem Lett       Date:  2002-06-17       Impact factor: 2.823

7.  One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome.

Authors:  Alice Capecchi; Daniel Probst; Jean-Louis Reymond
Journal:  J Cheminform       Date:  2020-06-12       Impact factor: 5.514

8.  Predictive QSAR modeling of phosphodiesterase 4 inhibitors.

Authors:  Vasyl Kovalishyn; Vsevolod Tanchuk; Larisa Charochkina; Ivan Semenuta; Volodymyr Prokopenko
Journal:  J Mol Graph Model       Date:  2011-10-14       Impact factor: 2.518

9.  N,N-Dialkylated 4-(4-arylsulfonylpiperazine-1-carbonyl)-benzamidines and 4-((4-arylsulfonyl)-2-oxo-piperazin-1-ylmethyl)-benzamidines as potent factor Xa inhibitors.

Authors:  Zhaozhong J Jia; Ting Su; Jingmei F Zuckett; Yanhong Wu; Erick A Goldman; Wenhao Li; Penglie Zhang; Lane A Clizbe; Yonghong Song; Shawn M Bauer; Wenrong Huang; John Woolfrey; Uma Sinha; Ann E Arfsten; Athiwat Hutchaleelaha; Stanley J Hollenbach; Joseph L Lambing; Robert M Scarborough; Bing-Yan Zhu
Journal:  Bioorg Med Chem Lett       Date:  2004-05-03       Impact factor: 2.823

10.  Discovery of betrixaban (PRT054021), N-(5-chloropyridin-2-yl)-2-(4-(N,N-dimethylcarbamimidoyl)benzamido)-5-methoxybenzamide, a highly potent, selective, and orally efficacious factor Xa inhibitor.

Authors:  Penglie Zhang; Wenrong Huang; Lingyan Wang; Liang Bao; Zhaozhong J Jia; Shawn M Bauer; Erick A Goldman; Gary D Probst; Yonghong Song; Ting Su; Jingmei Fan; Yanhong Wu; Wenhao Li; John Woolfrey; Uma Sinha; Paul W Wong; Susan T Edwards; Ann E Arfsten; Lane A Clizbe; James Kanter; Anjali Pandey; Gary Park; Athiwat Hutchaleelaha; Joseph L Lambing; Stanley J Hollenbach; Robert M Scarborough; Bing-Yan Zhu
Journal:  Bioorg Med Chem Lett       Date:  2009-03-03       Impact factor: 2.823

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.