Literature DB >> 35350358

Improving Compound Activity Classification via Deep Transfer and Representation Learning.

Vishal Dey¹, Raghu Machiraju^1,2,3, Xia Ning^1,2,3.

Abstract

Recent advances in molecular machine learning, especially deep neural networks such as graph neural networks (GNNs), for predicting structure-activity relationships (SAR) have shown tremendous potential in computer-aided drug discovery. However, the applicability of such deep neural networks is limited by the requirement of large amounts of training data. In order to cope with limited training data for a target task, transfer learning for SAR modeling has been recently adopted to leverage information from data of related tasks. In this work, in contrast to the popular parameter-based transfer learning such as pretraining, we develop novel deep transfer learning methods TAc and TAc-fc to leverage source domain data and transfer useful information to the target domain. TAc learns to generate effective molecular features that can generalize well from one domain to another and increase the classification performance in the target domain. Additionally, TAc-fc extends TAc by incorporating novel components to selectively learn feature-wise and compound-wise transferability. We used the bioassay screening data from PubChem and identified 120 pairs of bioassays such that the active compounds in each pair are more similar to each other compared to their inactive compounds. Overall, TAc achieves the best performance with an average ROC-AUC of 0.801; it significantly improves the ROC-AUC of 83% of target tasks with an average task-wise performance improvement of 7.102%, compared to the best baseline dmpna. Our experiments clearly demonstrate that TAc achieves significant improvement over all baselines across a large number of target tasks. Furthermore, although TAc-fc achieves slightly worse ROC-AUC on average compared to TAc (0.798 vs 0.801), TAc-fc still achieves the best performance on more tasks in terms of PR-AUC and F1 compared to other methods. In summary, TAc-fc is also found to be a strong model with competitive or even better performance than TAc on a notable number of target tasks.

Entities: Chemical

Year: 2022 PMID： 35350358 PMCID： PMC8945064 DOI： 10.1021/acsomega.1c06805

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Drug discovery is a time-consuming and expensive process[1]—it takes at least 10 years and at least $1 billion to fully develop a drug.[2] During the initial stages of this process, promising drug candidates are identified by screening a large library of chemical compounds and then further investigated for specific properties. In order to speed up this process, computational approaches[3,4] have been adopted, particularly for identifying potential drug candidates during the initial stages of drug discovery. Computational approaches explore a much larger space of chemical compounds to predict their physio-chemical properties and/or biological activities toward the target. In this paper, we consider the problem of compound bioactivity classification, where a compound is classified as active or inactive based on whether that compound binds to the protein target. Biological activities of compounds are initially examined in a bioassay by measuring their binding affinities or dissociation constants toward the target. Significant research[5−7] has established the relationship between the chemical structures and biological activities of compounds, also known as structure–activity relationships (SARs).[5] Several computational approaches[8] have been developed to model SARs and to predict compound bioactivities from their 2D/3D structures. However, most popular approaches such as deep neural networks require large amounts of labeled data for effective SAR modeling. Thus, the limited availability of bioassay data for specific targets still poses a major challenge in effective SAR modeling.[9] Over the years, several methods[10−12] aimed to improve SAR predictions for specific targets by leveraging activity information from related targets. These methods consider targets to be related, based on the principles from chemogenomics.[13−16]The key principle behind these methods is that similar proteins tend to bind to structurally similar compounds. In this work, we consider proteins belonging to the same protein family to be similar. Thus, leveraging compound activity information from bioassays corresponding to a set of proteins from the same protein family (e.g., G-coupled protein receptors, kinases, peptidases, etc.) collectively might better inform the SAR model than the individual bioassays. In essence, transfer learning can enable better SAR modeling by leveraging information from such related bioassays. However, existing methods are instance-based transfer learning methods.[17] They select a subset of data from related bioassays and then augment the training data for the target task with the selected subset. Existing deep transfer learning-based methods[18] for SAR modeling are either parameter-based (such as fine tuning) or feature-based, out of which parameter-based methods are more popular. However, such methods can lead to overfitting and negative transfer,[17] especially when the targets are not related. In this regard, we believe that feature-based methods are better in that they can learn the similarity/relatedness between the targets in the latent space in a data-driven manner. Primarily, we develop an instance-based transfer learning method TAc that leverages target information from related bioassays, based on the key principle of chemogenomics as mentioned earlier. We further extend TAc to novel feature-based deep transfer learning methods TAc and TAc-fc that quantitatively measure transferability and explicitly learn what to transfer in a fully data-driven manner. To this end, we develop novel components to learn feature-wise and compound-wise transferability in order to effectively encode the commonalities among compounds of different tasks. In order to represent compounds, we leveraged the popular idea of a directed message passing neural network (dmpn)[19] and added an attention-based pooling mechanism, denoted as dmpna. We collected a set of confirmatory bioassays from PubChem[20] that have a single protein target and are tested on chemical substances. We identified 120 bioassay pairs involving 59 protein targets such that the active compounds in each pair are more similar to each other compared to the inactive compounds. We compared our methods TAc and TAc-fc with several baselines with respect to two aspects: compound representation and transfer mechanisms. Overall, TAc-dmpna achieves the best performance compared to all other methods. Compared to TAc-dmpna, TAc-fc-dmpna performs slightly worse, but the latter still provides significant performance improvement on some target tasks. This suggests that although the transfer mechanism in TAc performs the best overall, the deep transfer mechanism with learned feature-wise and compound-wise transferability can actually benefit some targets. Furthermore, experimental results demonstrate the efficacy of our proposed attention mechanism of dmpna in learning better compound features. We provide additional experiments on the compound prioritization problem[12] where dmpna clearly outperforms all other compound representation methods. The rest of the paper is organized as follows. Section presents the related works in drug discovery and transfer learning with applications in SAR predictions. Section presents the materials used for experimental evaluation, experimental results, and detailed analyses with discussions. Section presents the conclusions, and Section presents the notations and definitions used in this paper and the proposed methods of transfer learning for activity prediction.

Related Work

In this section, we provide a brief overview of existing works and divide them across three subsections as follows. In Section , we summarize notable works on computational approaches in drug discovery. In Section , we provide a brief overview of existing works in transfer learning. In Section , we provide an overview of existing methods that use transfer learning for better SAR modeling.

Computational Methods in Drug Discovery

The first step in the drug discovery process is to conduct bioassays[21] that screen a large set of compounds for desirable properties (e.g., activity, solubility, and toxicity). The findings from these bioassays guide the later steps of the drug discovery process. In order to speed up initial stages of the drug discovery process, computational approaches have been adopted. Computational approaches to predict activities/properties of compounds from their molecular structures have been a significant research area in cheminformatics.[8,22,23] These approaches rely on the quantitative structure–activity/property relationship (QSAR/QSPR)[5,24] to predict compound activities/properties as expressed in bioassays. In order to predict such activities/properties, machine learning methods such as classification and regression are typically used. Binary/real-valued observations from bioassay data are used to train these classification/regression methods. Popular conventional classification and regression methods to predict compound activities/properties consist of support vector machines,[25−28] random forests,[29,30] Bayesian models,[31,32] etc. In these methods, compounds are typically represented by hand-crafted molecular fingerprints[33,34] or descriptors.[35] Recently, deep learning methods[36−40] have demonstrated significant performance improvement over conventional methods across several activity/property prediction tasks.[41−43] Unlike conventional methods, these methods do not require careful and expensive design of hand-crafted molecular fingerprints or descriptors by domain experts. These methods learn the compound representations from molecular graphs[19,44−48] and SMILES strings,[49−51] in a fully data-driven manner for each task. Such learned representations are task-specific and can better encode relevant structures for each task. Thus, such learned representations are often more effective than molecular fingerprints or descriptors. While these deep learning models have achieved the state-of-the-art performance on several molecular activity/property prediction tasks, these models require a large amount of labeled training data to encode relevant patterns into learned representations. Training these models with limited labeled data for certain prediction tasks often leads to subpar performance.

Transfer Learning

In order to effectively train models with limited labeled data for certain prediction tasks, transfer learning between related tasks has been widely explored in Computer Vision (CV) and Natural Language Processing (NLP).[52,53] Transfer learning[17] is an emerging research area in which knowledge gained from auxiliary tasks is transferred to improve the predictive performance of the target task. Instead of training a model for the target task from scratch, a popular transfer learning technique, called fine-tuning,[54] fine-tunes the model pretrained from other related tasks. Pretraining does not explicitly learn what/when to transfer and rather relies on the model parameters to encode and transfer information across different tasks. Although pretraining is the most popular transfer learning method, it does not guarantee improvement (due to “negative transfer”[55]). Moreover, fine-tuning a highly parametrized model with limited data may lead to overfitting to the training data, and thus, the fine-tuned model might not generalize well to the test data. Apart from pretraining, another area of deep transfer learning, called domain adaptation, has gained a lot of attention.[56−58] Domain adaptation methods reduce the effect of the domain shift by learning domain-invariant representations that can generalize well across different tasks. In order to learn such representations, domain adaptation methods either minimize statistical measures[59−61] of domain shift or use adversarial training.[62] Following the success of adversarial training in generative adversarial networks (GANs),[63] adversarial domain adaptation methods[64−66] gained more attention and demonstrated state-of-the-art performances over benchmark CV and NLP data sets. Adversarial domain adaptation methods use adversarial training to learn domain-invariant representations via a minimax optimization using a feature extractor, a domain classifier, and a label predictor. The principle of adversarial training is used to train the feature extractor to learn domain-invariant representations which are indistinguishable by the domain classifier. Seminal methods in adversarial domain adaptation[64,65,67] differ in the design choices, such as adversarial loss functions, optimization, coupling of weights, etc. Other existing methods focus on conditional feature alignment,[68,69] multisource transfer,[70,71] etc. However, these methods have been specifically developed for image domain adaptation or image translation problems. To the best of our knowledge, none of these methods have been widely adapted for graph-structured data. In this work, following the idea of adversarial domain adaptation, we proposed a novel transfer learning method that learns effective compound representations from graph-structured data and transfers relevant information from a related task to the target task.

Transfer Learning in SAR Predictions

To alleviate the limited data problem in cheminformatics, various transfer learning[18,72−75] and multitask learning methods[76−80] have been recently developed. Inspired by the success of pretraining followed by fine-tuning in CV and NLP, Goh et al.[81] proposed ChemNet, where a deep neural network is pretrained on a large set of compounds in a self-supervised manner and then fine-tuned on individual activity/prediction tasks. Following the same idea, Li and Fourches[82] proposed MolPMoFit which trains a long short-term memory (LSTM)[83] on SMILES strings of compounds and then fine-tunes the pretrained model on specific tasks. Although pretraining has been widely studied, existing work in cheminformatics does not demonstrate significant performance improvement over the state-of-the-art supervised models in a single-task setting. Moreover, models trained on SMILES strings do not explicitly leverage the topological information of compounds. However, our methods use molecular graphs as inputs and hence explicitly leverage the topological information. Adversarial transfer learning has been rarely explored for SAR predictions and on graph-structured data. To the best of our knowledge, only recently Abbasi et al.[84] combined multitask networks and adversarial domain adaptation to learn transferable molecular representations from multiple-source bioassays to improve the prediction performance on the target bioassay. The authors evaluated their model on biophysics and physiology data sets such as Tox21, SIDER, BACE, ToxCast, and HIV. Experimental results demonstrated that the proposed method outperforms no-transfer methods only on a few target tasks. Moreover, experimental results do not clearly demonstrate the contribution of the adversarial domain adaptation component to the overall performance. Overall, prior work on transfer-learning-based SAR modeling does not clearly suggest a performance gain over conventional SAR models over a wide array of target tasks.

Results and Discussion

In this section, we present the materials used for experimental evaluation (Section ), followed by detailed experimental results and discussions (Sections –2.7).

Materials

In this section, we describe the data set generation, baseline methods, and experimental protocols in detail.

Data Set Generation

We used the real screening data from PubChem to test our methods. PubChem[20,85] is one of the largest public chemical databases with more than 271 M substances, 111 M unique chemical structures, and 293 M bioassay data. We selected a set of bioassays from PubChem bioassays [accessed on 2020-12-25] such that each bioassay has a sufficiently large number of active and inactive compounds. Then, we generated pairs of bioassays for transfer learning in accordance to the protocols in below Sections and 2.1.1.2.

Initial Bioassay Selection and Pruning

We first selected a set of 7284 confirmatory bioassays that have a single protein target and are tested on chemical substances. These bioassays have 1279 unique protein targets in total. Among these protein targets, we were able to identify the organism and protein family information for 961 protein targets within 435 protein families using UniProt.[86] Among the 435 protein families, we further combined them into 278 families (e.g., Peptidase A1, Peptidase C12, and Peptidase C13 families were combined into the peptidase family). Among the 278 families, we selected 10 that have the most protein targets belonging to “Human” organisms. These top 10 protein families are the (1) G-protein-coupled receptor 1 family, (2) peptidase family, (3) protein kinase family, (4) nuclear hormone receptor family, (5) protein-tyrosine phosphatase family, (6) ABC transporter family, (7) cytochrome P450 family, (8) Bcl-2 family, (9) G-protein-coupled receptor 3 family, and (10) histone deacetylase family. These protein families involved 269 unique protein targets and covered the major drug targets in drug discovery.[87,88] According to the 10 protein families, bioassays with targets from these protein families were then processed as follows: We combined bioassays of the same target into one bioassay, resulting in 269 combined bioassays. For each combined bioassay, we selected its compounds that were tested for inhibition against the target (i.e., the corresponding PubChem activity type specified by the depositor was “inhibitor”). From those inhibitive compounds, we selected the compounds that were specified as either “active” or “inactive” against the target and discarded the compounds that were specified as “inconclusive” or “undetermined”. If the active/inactive compounds appeared multiple times in the bioassay with the same activity label, we retained one of their records. If the active/inactive compounds appeared multiple times in the bioassay with different activity labels, we removed the compounds from the bioassays (in our data set, only about 2.08% of compounds for each bioassay on average appear multiple times with different activity levels). We use canonical SMILES strings to detect identical compounds. After the above processing, each combined bioassay has on average 17 005 unique compounds in total, with 188 active and 16 817 inactive. Furthermore, out of the 269 combined bioassays, 95 bioassays have more than 50 active compounds. Among the 10 protein families involved in the 95 bioassays, 2 protein families had only 1 target with more than 50 active compounds. Thus, we removed these 2 protein families and only used the remaining 8 protein families and their 93 bioassays. This set of 93 bioassays has on average 40 115 compounds, with 521 active and 39 595 inactive. This set of bioassays will be used to create bioassay pairs as will be described in the next section. Table S1 in the Supporting Information presents the statistics of each of the 93 bioassays.

Transferable Bioassay Pairing

From the 93 processed bioassays, we constructed 765 bioassay pairs such that in each pair the protein targets of the two bioassays are from the same protein family. We selected targets from the same protein family because based on the key intuition of chemical genomics[15,16]−proteins from the same family tend to have similar binding pockets and bind to similar compounds—this is the physicochemical foundation to enable possible information transfer across protein targets, and such targets and their bioassays can be used to test transfer learning. We first ensured that each of the 765 pairs of bioassays had balanced active and inactive compounds as follows: In each pair of bioassays, we removed the compounds that appeared in both bioassays but with different activity labels (on average, 2.09% of all unique compounds in a pair of bioassays). This is to avoid any conflicting information across bioassays, which could adversely affect our transfer learning method. For compounds with the same activity labels in both the bioassays (on average, 1.82% of all unique compounds in a pair of bioassays), we randomly sampled half of them into one of the bioassays and the other half into the other. This is to avoid duplication of compounds across bioassays, which could lead to overestimation of predictive performance. After the above steps, for each bioassay of a pair, we used all its active compounds and randomly sampled the same number of inactive compounds. If the inactive compounds were not sufficient, we randomly sampled compounds from PubChem that were not active in the bioassay as additional inactive compounds for the bioassays. This is to ensure that each bioassay in a pair has an equal number of active and inactive compounds, and thus the learning will not be dominated by either active or inactive compounds. Please note that a bioassay involved in two pairs may have different numbers of active and inactive compounds due to its pairing to the other bioassay. After the above steps, we selected the bioassay pairs such that each bioassay in each pair had at least 50 active compounds retained. There were 635 such pairs that involved 92 bioassays in total. Among the 635 pairs of bioassays, we further selected the pairs as follows, such that the active compounds in each pair are similar to each other compared to their inactive compounds: 1. For a pair of bioassays B and B and their respective active and inactive compounds, denoted as , , and , respectively, we calculated the following two types of average compound similarities using the Taminoto coefficient[89] over Morgan-count fingerprints (with radius = 3 and dimension = 2048): (1) among compounds of the same labels across the two bioassays and and (2) among compounds of different labels across the two bioassays and . 2. Based on the similarities, we selected a set of bioassay pairs, denoted as , such that in each pair the active compounds of the two bioassays are more similar; that isWe identified 329 such pairs. From , we further selected a set of bioassay pairs, denoted as , such that in each pair the active compounds in the two bioassays have a similarity above a certain threshold, that is,where 0.026 is the average value among all these pairs After the above process, we identified 120 pairs of bioassays in , involving 59 bioassays and 7 protein families with 278 active and 278 inactive compounds in each bioassay on average. Table S2 presents all the pairs and their compound statistics.

Baseline Methods

We tested our TAc and TAc-fc methods with respect to two aspects: (1) compound representations and (2) transfer mechanisms. Compound representation is key to revealing information among compounds that can be leveraged to transfer across. Transfer machenisms are critical to enable effective transfer of revealed information across bioassays.

Compound Representation Methods

Specifically, we compared our compound representation method dmpna (i.e., the feature learner in Section ) with the following compound representation methods: Binary Morgan fingerprint (morgan):[33]morgan uses a binary feature vector to present a compound, in which each dimension of the feature vector corresponds to a predefined substructure, and the binary value in that dimension represents if the compound has that substructure or not. Morgan count fingerprints (morgan-c):[33]morgan-c is very similar to morgan except that the values in morgan-c represent how many corresponding substructures the compound has. Directed Message Passing Network (dmpn):[19] The dmpn method learns molecular structures by passing messages along directed edges over molecular graphs. It produces two representations for each bond through message passing through the two directions along the bond. Then it learns atom representations from the incoming bond representations and generates a compound representation using mean pooling over the atom representations. The dmpn (https://github.com/chemprop/chemprop) method is the state-of-the-art compound embedding a learning approach for compound property prediction. We generated morgan and morgan-c (with radius = 3 and size = 2048) using RDKit.[90] In order to only compare the different compound representation methods, not the transfer learning mechanisms, we used a two-layer fully connected network as the classifier S over the above baseline feature representations to predict activity labels. We used cross-entropy as the loss function for these baseline methods. The corresponding methods are denoted as FCN-morgan, FCN-morganc, and FCN-dmpn, respectively. Note that these three baseline methods do not have information transfer mechanisms—they are single-task compound prediction methods.

Learning Methods for Compound Prediction

We compared TAc and TAc-fc with a transfer learning baseline known as domain-adversarial neural network, denoted as DANN.[64] We selected DANN because, to the best of our knowledge, there are no existing transfer learning methods over graph-structured data, and DANN is a standard transfer learning baseline method used on other data (e.g., images).[58] In particular, we adapted DANN to learn compound features from graph-structured data via GNN (e.g., dmpn or dmpna). DANN consists of three components: (1) a feature extractor that represents compounds via feature learning; (2) a label predictor that predicts activity labels from learned compound features; and (3) a domain classifier that discriminates between the source and target compounds during training. DANN learns compound features that can generalize well from one domain to another, such that the learned features contain little discriminative domain information and enable DANN to accurately predict activity labels. The objective function in DANN consists of two losses: domain classification loss and label prediction loss. DANN uses a minimax optimization such that the domain classification loss is minimized with respect to the domain classifier and is maximized with respect to the feature extractor. Specifically, minimizing the domain classification loss will encourage the domain classifier to correctly discriminate between the source and target compounds. On the other hand, maximizing the domain classification loss will encourage the learning of generalizable compound features. The feature learner and discriminators in TAc-fc are learned via a minimax optimization, similar to how the feature extractor and the domain classifier in DANN are learned. However, TAc-fc is different from DANN in that TAc-fc learns feature-wise transferability and compound-wise transferability, while DANN only learns compound-wise transferability. Furthermore, following Ganin et al.,[64]DANN is trained on labeled data from the source domain and unlabeled data from the target domain.

Experimental Protocols

Experimental Settings

In our experiments, we split each of the target bioassay in a pair into 10 folds. For the target bioassay, we used 1 fold for modeling training, 1 fold for validation, and remaining 8 folds for testing. We performed the above process 10 times, with a different training fold each time, and reported the average performance over the test folds. The above 1:1:8 training/validation/testing ratio follows a typical setting in transfer learning,[91] where it is assumed that the training data are limited, so it is needed to leverage other tasks via transfer. We used this cross-validation setting because we did not have a benchmark test set for each bioassay, and a 10-fold cross-validation will reduce the variance of the model performance. When we transferred the information from the source bioassay to the target bioassay, we used all the folds of the source bioassay and the training fold of the target bioassay in TAc in order to maximize the information content in the source bioassay that could be leveraged. If the baseline methods do not have an information transfer mechanism (e.g., FCN-morgan), we applied an additional setting to simulate information transfer: in addition to the target task’s own training compounds, we also used all the compounds from the source task as training data of . Thus, the compounds will enrich the training data and bring (i.e., transfer) information from Src directly to Tgt. This setting is referred to as data transfer, denoted as DT. If we only use the compounds for training as in conventional single-task models, this setting is denoted as noT. We trained each model using an ADAM[92] optimizer with an initial learning rate of 10–3. All the models are trained up to 40 epochs. We used a grid search to tune all the hyper parameters such as the dimension d of the compound embedding r, hidden-layer dimension of the attention layer for dmpna, hidden-layer dimension in L and G, and batch size. We used the validation set to determine the optimal number of epochs. During training, we evaluated the performance of each model on the validation set at every epoch and chose the trained model at some epoch k that gives the best performance on the validation set; thus, we selected k as the optimal number of epochs. We used the ROC-AUC metric for the above performance evaluation. All evaluation metrics are discussed in the following section. All the hyper parameters are reported in Table S3 for reproducibility purposes.

Evaluation Metrics

We used the following evaluation metrics: area under the precision–recall curve (PR-AUC), area under the receiver operating characteristic curve (ROC-AUC), precision, sens, accuracy, and F1 score. Area under the precision–recall curve (PR-AUC): A precision–recall curve is generated by (precision, recall) value pairs corresponding to variable thresholds. PR-AUC measures the area under the precision–recall curve and provides an aggregate measure of performance across all possible thresholds. Area under the receiver operating characteristics curve (ROC-AUC): A receiver operating characteristic (ROC) curve is generated by true positive rates against false positive rates at various threshold values. ROC-AUC measures the area under the ROC curve. precision: it is the ratio of correctly predicted positive instances out of all predicted positive instances (e.g., the ratio of predicted active compounds that are truly active). sens: it is the ratio of correctly predicted positive instances out of all ground-truth positive instances (e.g., the ratio of active compounds that are correctly predicted as active). accuracy: it is the ratio of correctly predicted (positive and negative) instances out of all instances (e.g., the ratio of compounds that are correctly predicted as active/inactive). F1-score: it is the harmonic mean of precision and sens. If the above metrics have higher values, they indicate better performance.

Data and Software Availability

All the data sets and source code are publicly available at https://github.com/ninglab/TransferAct.

Overall Performance

Table presents an overall performance comparison between TAc-dmpn, TAc-fc-dmpn, TAc-dmpna, TAc-fc-dmpna, and the baselines. The columns have the average and standard deviation over all bioassays in respective evaluation metrics achieved by the optimal models. Note that for each bioassay the optimal model of each method is the model that gives the best ROC-AUC value, and thus the performance of each method in other metrics does not necessarily correspond to the optimal performance in those metrics.

Table 1

Overall Comparisona

method	ROC-AUC	PR-AUC	precision	sens	accuracy	F1
FCN-morgan	0.727 ± 0.124	0.729 ± 0.121	0.648 ± 0.104	0.742 ± 0.131	0.661 ± 0.110	0.683 ± 0.105
FCN-morganc	0.731 ± 0.120	0.730 ± 0.118	0.653 ± 0.102	0.735 ± 0.132	0.664 ± 0.107	0.682 ± 0.105
FCN-dmpn	0.754 ± 0.101	0.733 ± 0.102	0.619 ± 0.116	0.739 ± 0.156	0.656 ± 0.087	0.655 ± 0.126
FCN-dmpna	0.755 ± 0.112	0.729 ± 0.112	0.660 ± 0.119	0.712 ± 0.165	0.665 ± 0.101	0.651 ± 0.136
FCN-dmpn (DT)	0.754 ± 0.104	0.735 ± 0.105	0.687 ± 0.106	0.686 ± 0.213	0.669 ± 0.088	0.655 ± 0.140
FCN-dmpna (DT)	0.763 ± 0.108	0.745 ± 0.109	0.702 ± 0.108	0.671 ± 0.213	0.672 ± 0.092	0.645 ± 0.148

DANN-dmpn	0.733 ± 0.103	0.715 ± 0.103	0.671 ± 0.110	0.647 ± 0.215	0.649 ± 0.084	0.623 ± 0.144
DANN-dmpna	0.734 ± 0.102	0.716 ± 0.104	0.676 ± 0.106	0.653 ± 0.226	0.651 ± 0.085	0.624 ± 0.154

TAc-dmpn	0.798 ± 0.103	0.785 ± 0.108	0.729 ± 0.095	0.729 ± 0.146	0.721 ± 0.093	0.714 ± 0.108
TAc-fc-dmpn	0.798 ± 0.102	0.784 ± 0.107	0.729 ± 0.094	0.731 ± 0.142	0.720 ± 0.091	0.715 ± 0.102
TAc-dmpna	0.801 ± 0.102	0.786 ± 0.107	0.731 ± 0.094	0.729 ± 0.143	0.720 ± 0.090	0.713 ± 0.103
TAc-fc-dmpna	0.798 ± 0.105	0.785 ± 0.109	0.730 ± 0.097	0.728 ± 0.147	0.719 ± 0.095	0.713 ± 0.109

In this table, the columns ROC-AUC, PR-AUC, precision, sens, accuracy, and F1-score have the average and standard deviation over all bioassays in each performance metric. The best performance values are bold. The second best performance values are underlined. Table shows that, overall, TAc-dmpna achieves the best performance compared to all other methods. Specifically, TAc-dmpna achieves the best average ROC-AUC, PR-AUC, and precision scores of 0.801, 0.786, and 0.731, respectively. This demonstrates that TAc-dmpna can learn effective compound features for the target task by leveraging source bioassay data and correctly predicts the compounds of the target bioassay. Furthermore, all variants of TAc and TAc-fc, especially TAc-dmpn, TAc-fc-dmpn, and TAc-fc-dmpna, achieve similar performance on average across all metrics. The performance of these three methods is not significantly different in most metrics. This suggests that learning feature-wise and compound-wise transferability via TAc-fc methods does not necessarily provide a performance boost on average. However, compared to the best method, TAc-dmpna, TAc-fc-dmpn, and TA-fc-dmpna improve ROC-AUC scores of 62% and 39% target tasks, respectively. On the whole, all variants of TAc and TAc-fc significantly outperform all baselines. Specifically, TAc-dmpna improves the average ROC-AUC by 4.9–10.1% and significantly improves ROC-AUC of at least 83% of the target tasks compared to any baseline method. Each of the other variants such as TAc-dmpn, TAc-fc-dmpn, and TAc-fc-dmpna improves ROC-AUC of at least 79% of the target tasks compared to any baseline method. This indicates that these methods can effectively transfer relevant information from the source task to the target task. In particular, the transfer learning mechanism in all variants of TAc and TAc-fc can better leverage source domain data compared to the transfer learning mechanism in other baselines. This is because both TAc and TAc-fc variants can better control the transferable information by incorporating varying degrees of task relatedness between the source and target tasks during training. Additionally, TAc-fc variants can better extract relevant information from source domain data by learning feature-wise (Section ) and compound-wise transferability (Section ). The best performance among the baseline methods is achieved by FCN-dmpna (DT). Table presents the performance comparison between TAc-dmpna and FCN-dmpna (DT). The diff % values in Table are calculated as the percentage difference of average performance in each metric from TAc-dmpna over FCN-dmpna (DT), where the average performance in each metric is calculated as the performance in that metric averaged over all the bioassays. The t-diff % values are calculated as the average of task-wise performance improvement (in %) from TAc-dmpna over FCN-dmpna (DT). The N-impv values denote the number of improved target tasks where TAc-dmpna performs better than FCN-dmpna (DT) in respective metrics. Considering only these N-impv improved tasks, the average of task-wise performance improvement (in %) is listed as t-impv % values. Similar to t-diff %, the numbers presented in parentheses in this row are the corresponding p-values for t-impv %. A p-value less than 0.05 was considered to be statistically significant.

Table 2

Performance Comparison of TAc-dmpna vs FCN-dmpnaa

method	ROC-AUC	PR-AUC	precision	sens	accuracy	F1
TAc-dmpna	0.801	0.786	0.731	0.729	0.720	0.713
FCN-dmpna (DT)	0.763	0.745	0.702	0.671	0.672	0.645
diff %	4.980	5.503	4.131	8.644	7.143	10.543
t-diff %	5.702	6.085	4.876	25.281	7.727	18.464
	(2.80 × 10^–19)	(8.00 × 10^–21)	(1.69 × 10^–11)	(1.73 × 10^–09)	(1.19 × 10^–29)	(8.93 × 10^–20)
N-impv	199 (83%)	192 (80%)	157 (65%)	153 (64%)	201 (84%)	198 (82%)
t-impv %	7.102	8.044	9.293	44.261	9.509	23.532
	(5.56 × 10^–22)	(5.51 × 10^–26)	(2.81 × 10^–27)	(7.36 × 10^–25)	(5.02 × 10^–35)	(3.60 × 10^–26)

In this table, the first two rows have the performance from respective methods averaged over all bioassays in each performance metric. The row diff % has the percentage difference of average performance in each metric from TAc-dmpna over FCN-dmpna (DT). The row t-diff % has the average of task-wise percentage improvement from TAc-dmpna over FCN-dmpna (DT) in respective metrics, with the corresponding p-value in parentheses below. The row N-impv has the number and percentage of target tasks where TAc-dmpna performs better than FCN-dmpna (DT) in respective metrics. The row t-impv % has the average of task-wise percentage improvement only among the corresponding improved tasks, with corresponding p-values in parentheses below. Clearly, compared to the best baseline method FCN-dmpna (DT), TAc-dmpna improves the average ROC-AUC, PR-AUC, precision, sens, accuracy, and F1 scores by 4.980%, 5.503%, 4.131%, 8.644%, 7.143%, and 10.543%, respectively. Furthermore, the average task-wise performance difference (i.e., t-diff%) from over FCN-dmpna (DT) across each metric is 5.702%, 6.085%, 4.876%, 25.281%, 7.727%, and 18.464%, respectively, and these differences are positive and statistically significant (as indicated by their corresponding p-values in parentheses), hence suggesting that the task-wise performance is significantly improved over FCN-dmpna (DT) . In particular, TAc-dmpna significantly improves the ROC-AUC performance of 199 out of 240 (83%) target tasks with an average task-wise improvement (i.e., t-impv %) of 7.102% (p-value: 5.56 × 10–22). Such consistent and significant improvement (demonstrated by t-impv % and their corresponding p-values) across all evaluation metrics on a large percentage of target tasks (demonstrated by N-impv) provides strong evidence that TAc-dmpna clearly outperforms FCN-dmpna (DT) on the majority of target tasks. This further implies that the transfer mechanism in TAc-dmpna is more effective than that in FCN-dmpna (DT) . While FCN-dmpna (DT) pays equal attention to both the source and target tasks during training, TAc-dmpna can differentially focus on the two tasks by varying the weightage on the source classification loss (i.e., the trade-off parameter α in eq ). Note that TAc-dmpna with α = 1 is methodologically equivalent to FCN-dmpna (DT). By varying α, FCN-dmpna (DT) can incorporate different degrees of task relatedness between the source and target tasks during training. If the two tasks are not that related, a lower α will encourage the learning to focus more on the target task. In essence, learned compound features are more specific to the target task. On the other hand, α as high as 1 will enforce learning of compound features that generalize well across the two tasks. Such features may encode little target task-specific information and, hence, are not effective. Furthermore, our experimental results in Table demonstrate the efficacy of our proposed attention mechanism of dmpna in learning better compound features. Overall, both dmpna-based methods (i.e., FCN-dmpna and FCN-dmpna (DT)) outperform dmpn-based methods (i.e., FCN-dmpn and FCN-dmpn (DT)). Particularly, compared to FCN-dmpn (DT), FCN-dmpna (DT) improves about half of the target tasks of ROC-AUC of 152 out of 240 (63%) target tasks and gives significant performance improvement of 3.443% (p-value: 2.64 × 10–18) on those improved target tasks. This demonstrates that the proposed attention mechanism in dmpna enables more effective compound features since it can differentially score atoms based on their relevance toward the final task. However, FCN-dmpna achieves either similar or slightly worse performance compared to FCN-dmpn. This is because dmpna with slightly more parameters than dmpn may struggle to capture relevant patterns during training, and thus FCN-dmpna can easily overfit to limited training data of the target task. In essence, this can lead to poor generalization performance on the test data. On the other hand, FCN-dmpna (DT) can generalize well since it is trained on the labeled source data along with the limited target data. Overall, the attention mechanism can better learn and effectively score the atoms in FCN-dmpna (DT) but not in FCN-dmpna, thereby achieving significant improvement in the former over FCN-dmpn (DT) and marginal improvement in the latter over FCN-dmpn. We will further demonstrate the efficacy of our proposed dmpna in the compound prioritization problem detailed in Section . Furthermore, all GNN-based baselines (i.e., FCN-dmpn, FCN-dmpna, FCN-dmpn (DT), and FCN-dmpna (DT)) significantly outperform DANN-based methods. Our experimental results show that both DANN-based methods yield poor or similar performance compared to all other baseline methods. Specifically, the best DANN method (i.e., DANN-dmpna) reduces the average performance by 3–4% over the best baseline method FCN-dmpna (DT) across all evaluation metrics. Such poor performance may be due to the ineffectiveness of domain-invariant compound features to encode necessary task-specific information. Surprisingly, DANN even performs worse than the fingerprint-based methods (i.e., FCN-morgan and FCN-morganc). As a matter of fact, overall, fingerprint-based methods perform relatively well compared to all other baselines. Compared to GNN-based methods, fingerprint-based methods achieve competitive or even better performance in most evaluation metrics. This could be due to potential overfitting of GNN-based methods in low-data settings. It is known that GNNs require large amounts of training data to extract relevant molecular substructures and to effectively encode meaningful task-specific information. In contrast, fingerprint-based methods are not data hungry owing to fewer learnable parameters, and thus, these methods can perform reasonably well in low-data settings.[93]

Top-N Task-Wise Performance Comparison

Table presents a fine-grained performance comparison of top-performing methods over all 240 target tasks across different evaluation metrics. The columns corresponding to each evaluation metric have the percentage of tasks for which each method is among the top-k (k = 1, 3, 5) best methods with respect to the metric. Note that for each method we consider the best performing model that achieves the optimal performance in each evaluation metric. Therefore, for a given method, the models with the optimal performance in each metric do not necessarily have the same set of corresponding hyperparameters.

Table 3

Top-N Performance Comparison (%)a

method	ROC-AUC			PR-AUC			F1
top-N	1	3	5	1	3	5	1	3	5
FCN-morgan	10	15	20	13	21	30	11	19	23
FCN-morganc	2	13	18	3	17	22	4	17	22
FCN-dmpn	5	9	17	6	11	17	10	19	30
FCN-dmpna	4	10	26	2	7	20	5	11	26
FCN-dmpn (DT)	2	11	23	4	13	24	5	13	22
FCN-dmpna (DT)	5	18	33	6	14	31	2	8	23

DANN-dmpn	2	9	16	1	5	15	3	10	19
DANN-dmpna	2	8	17	1	8	15	2	11	19

TAc-dmpn	18	52	85	17	48	81	12	49	81
TAc-fc-dmpn	12	45	79	15	48	80	17	50	80
TAc-dmpna	22	62	89	14	55	83	13	43	78
TAc-fc-dmpna	16	51	81	20	56	86	15	50	81

In this table, the columns ROC-AUC, PR-AUC, and F1 have the percentage of tasks for which each method is ranked within the top-1, top-3, and top-5 best methods in respective metrics. The best performance values are in bold. Table shows that TAc methods achieve the top-1 best performance among more tasks compared to other methods. For example, TAc-dmpna is the best performing method in terms of ROC-AUC for 22% of tasks, that is, more than 2-fold compared to the best baseline method FCN-morgan (10%). TAc-dmpna consistently achieves the top-3 and top-5 best performance in terms of ROC-AUC on significantly more tasks compared to other methods, with even more folds of difference. Similar trends hold for PR-AUC and F1 as the evaluation metrics. This indicates the strong performance of TAc methods. Among the four TAc variants, TAc-dmpna is the best in terms of ROC-AUC; TAc-fc-dmpna is overall the best in terms of PR-AUC as it achieves the top-1, top-3, and top-5 best performance on more tasks compared to other methods; and TAc-fc-dmpn and TAc-fc-dmpna are the best in terms of F1 as they are either better than or similar to other methods. This indicates that while different variants may have advantages of optimizing with respect to different evaluation metrics TAc-fc (Figure , with the feature-wise and compound-wise discriminators) is actually also a very strong method or even better compared to TAc.

Figure 7

Proposed architecture of TAc-fc. The feature learner F learns compound embedding r given the corresponding molecular graph. The feature-wise discriminator L learns feature-wise transferability given the learned compound embedding r. r is further scaled into z using its feature entropy from p out of L. The compound-wise discriminator G learns the compound-wise transferability given z. The domain-wise classifier S classifies the compound as active/inactive.

Comparison of Discriminators

Table presents a detailed performance comparison between TAc, TAc-fc, TAc-c, and TAc-f, all with dmpna. We use dmpna here because as in Table TAc-fc-dmpna shows better performance on average compared to TAc-fc-dmpn. TAc-c and TAc-f are obtained by removing either the feature-wise discriminator L (Section ) or the compound-wise discriminator G (Section ) from TAc-fc. Note that for each bioassay the optimal model of each method is selected based on ROC-AUC. The diff % values in each row block are calculated as the difference (in %) of average performance in each metric from the TAc-fc variant over TAc. The t-diff % values in each row block are calculated as the average of task-wise performance improvements (in %) from the corresponding variant over TAc. The row N-impv in each row block denotes the number of improved target tasks where the variant performs better than TAc in respective metrics, and the average of task-wise performance improvement among only the improved tasks is calculated as t-impv % (in %).

Table 4

Comparison on Discriminators (with dmpna)a

method	ROC-AUC	PR-AUC	precision	sens	accuracy	F1
TAc	0.801	0.786	0.731	0.729	0.720	0.713

TAc-fc	0.798	0.785	0.730	0.728	0.719	0.713
diff %	–0.375	–0.127	–0.137	–0.137	–0.139	0.000
t-diff %	–0.380	–0.119	–0.072	0.116	–0.150	–0.064
	(2.59 × 10^–04)	(4.88 × 10^–01)	(7.26 × 10^–01)	(7.00 × 10^–01)	(5.53 × 10^–01)	(8.04 × 10^–01)
N-impv	93 (39%)	125 (52%)	119 (50%)	111 (46%)	99 (41%)	112 (47%)
t-impv %	0.921	1.373	2.756	7.037	2.230	3.729
	(7.73 × 10^–18)	(4.13 × 10^–23)	(1.69 × 10^–18)	(5.85 × 10^–19)	(4.26 × 10^–14)	(1.31 × 10^–15)

TAc-c	0.801	0.786	0.730	0.734	0.721	0.716
diff %	0.000	0.000	–0.137	0.686	0.139	0.421
t-diff %	0.010	–0.080	–0.130	1.583	0.128	0.516
	(7.78 × 10^–01)	(6.72 × 10^–01)	(6.32 × 10^–01)	(3.03 × 10^–01)	(4.01 × 10^–01)	(3.90 × 10^–01)
N-impv	135 (56%)	119 (50%)	123 (51%)	126 (52%)	128 (53%)	130 (54%)
t-impv %	0.845	1.330	2.763	8.798	1.971	4.165
	(1.13 × 10^–23)	(5.03 × 10^–24)	(2.67 × 10^–17)	(8.22 × 10^–18)	(4.75 × 10^–21)	(1.75 × 10^–15)

TAc-f	0.799	0.785	0.732	0.722	0.721	0.711
diff %	–0.250	–0.127	0.137	–0.960	0.139	–0.281
t-diff %	–0.192	–0.091	0.218	–0.768	0.170	–0.326
	(4.00 × 10^–02)	(3.76 × 10^–01)	(5.44 × 10^–01)	(1.42 × 10^–01)	(3.19 × 10^–01)	(4.98 × 10^–01)
N-impv	100 (42%)	114 (48%)	124 (52%)	117 (49%)	125 (52%)	123 (51%)
t-impv %	1.029	1.597	2.992	6.999	2.037	3.764
	(3.22 × 10^–13)	(2.22 × 10^–21)	(3.85 × 10^–24)	(1.11 × 10^–19)	(9.65 × 10^–20)	(1.12 × 10^–16)

In this table, the first row block has the average performance of TAc. Each of the other row blocks has the performance comparison of a TAc-fc variant with respect to TAc. The metric diff % represents the difference of average performance of each comparison method with respect to TAc; t-diff % represents the average of the task-wise improvement, with corresponding p-values in the parentheses below; N-impv represents the number of improved tasks and its proportion in the parentheses; and t-impv % represents the average of the task-wise improvement only among the improved tasks, with corresponding p-values in the parentheses below. Compared to TAc, TAc-fc achieves similar but slightly worse performance overall (i.e., −0.375% in diff % on ROC-AUC); on individual tasks, TAc-fc has some statistically significant worse performance in terms of ROC-AUC but similar performance as TAc on other evaluation metrics. In addition, TAc-fc still provides significant task-wise improvement in about 40–50% of tasks across all evaluation metrics. Particularly, it improves the ROC-AUC score for 93 out of 240 (39%) tasks significantly by 0.921% on average (p-value = 7.73 × 10–18) over TAc. This suggests that the learned feature-wise and compound-wise transferability together have the capacity of improving some target tasks. TAc-c performs similarly to TAc on average (i.e., 0.000 in diff % on ROC-AUC; no significant t-diff %). However, TAc-c improves over more than half of the tasks with statistical significance on all the evaluation metrics. For example, TAc-c achieves better ROC-AUC on 135 out of 240 (56%) tasks. This indicates that the global discriminator (Section ) that differentiates compounds for the source and target tasks could help improve performance for some tasks. TAc-f also shows improvement on about half of the tasks (N-impv) with significant improvement that is even higher compared to that in TAc-c but with overall performance (diff %) still slightly worse than that of TAc. The fact that TAc-fc, TAc-c, and TAc-f improve about half of the tasks over TAc without discriminators indicates that they are suitable for certain tasks. We hypothesize that TAc-c can effectively focus on similar compounds of source and target bioassays by learning compound-wise transferability via G. We validate this hypothesis with an additional analysis on model predictions and pairwise similarities of predicted compounds with source and target compounds. We find and study the active compounds that are correctly classified as active by TAc-c but incorrectly classified as inactive by TAc and its variants. Table S4 presents the analysis for these active compounds among target tasks which have at least one such active compound. For each of such active compounds in the target task, we calculated the mean pairwise similarities of that compound with its five most similar active compounds in the source task and in the target task, respectively. On average, TAc-c correctly classifies 5.4% (i.e., average of values in “cor %” column in Table S4) of active compounds that are incorrectly classified by TAc and its variants. These compounds were found to be 12.4% more similar to the active compounds in the source task than to those in the target task. Furthermore, in 47 out of 97 (48%) tasks with at least one active compound only correctly classified by TAc-c, the similarity difference is statistically significant (p-value < 0.05). Overall, this analysis demonstrates that TAc-c can better learn the commonalities between source and target compounds and hence can enhance information transfer from the source task to the target task.

Parameter Study

Figure presents the parameter study in TAc-dmpna on α (i.e., the trade-off parameter between the source and target classification losses as in eq ). The study was conducted over the tasks for which TAc-dmpna outperforms the other methods. The values in each cell in the figure represent the average of the best performance over the tasks with the optimal choice of other hyperparameters.

Figure 1

Parameter study of TAc-dmpna. The columns represent different evaluation metrics. The values in each cell have the average of the best performance achieved with given α and optimal choice of other hyperparameters. Darker cells indicate better performance. Figure shows that TAc-dmpna has the best average performance in ROC-AUC, PR-AUC, precision, sens, accuracy, and F1-score when α = 0.5, 0.5, 0.1, 0, 0.5, and 0.1, respectively. It indicates that weighing the source and target classification losses differently has notable effects on the overall performance. This figure also demonstrates several trends: (1) the best average performance is achieved with α = 0.1 and 0.5 (i.e., nonzero values) for all the metrics except sens and (2) performance degrades especially when α increases. Nonzero values of α as the optimal values indicate that leveraging information from the source task is able to help improve the target task. The fact that the optimal, nonzero α values are relatively small indicates that the training is still more focused on the target tasks, while useful information is transferred from the source tasks. On the other hand, if α is too large (i.e., the source classification loss is given high weightage), the training would be dominated by the source task, and thus the trained model could not well capture the patterns in the target task. That could explain why model performance decreases when α increases. Figure presents the parameter study in TAc-fc-dmpna in terms of ROC-AUC on α (i.e., the trade-off parameter between source and target losses in eq ) and λ (i.e., the trade-off parameter between the classification and discriminator losses in eq ). Studies over other metrics are presented in Figure S1 in the Supporting Information. The values in each cell of this figure represent the average of the best performance over the tasks where TAc-fc-dmpna outperforms all other methods, with corresponding α and λ and with optimal choice of other hyperparameters.

Figure 2

Parameter Study of TAc-fc-dmpna in terms of ROC-AUC.

Parameter Study of TAc-fc-dmpna in terms of ROC-AUC. Figure shows that TAc-fc-dmpna has the best performance in ROC-AUC (i.e., 0.769) when α = 0.5 and λ = 0.01 and 0.001, that is, all nonzero values. This demonstrates that a lower weight on the source classification loss than the target classification loss and a lower weight on discriminator losses (sum of and ) will enable effective transfer of relevant information from the source domain. Figure also demonstrates that when α is too small or too large, regardless of what λ is, there is a significant performance drop (as indicated in the topmost rows). This effect of α can be explained following the same reasoning presented in the previous section. For the optimal α in each metric, λ = 0.01 gives the best performance for most metrics. This implies that TAc-fc-dmpna can effectively leverage source task data to learn transferable compound features (using L) and to selectively focus on similar compounds (using G) during training. Intuitively, for a given α, higher λ values (i.e., higher weight on discriminator losses) will encourage learning of more domain-invariant compound features. Such domain-invariant features contain little task-specific information and may not be relevant for effective activity classification for the target task, and therefore the overall performance degrades.

Case Studies:

TAc-dmpna

Relation between Performance Improvement and Bioassay Similarity

Among 240 tasks, we identified and studied four tasks with a significant performance difference in ROC-AUC from TAc-dmpna over the best no-transfer baseline method (i.e., FCN-dmpna). Figure presents the average pairwise similarity matrices of the four task pairs (captions include the corresponding bioassay PubChem AIDs of the source bioassay and the target bioassay), where Figure a and 3b have the target tasks that are significantly improved by TAc-dmpna and Figure c and 3d have the target tasks that are significantly degraded. In the figure, , , and denote the active (+) and inactive (−) compounds for the source (BS) and target (BT) tasks, and average compound similarities (sim) were calculated using the Tanimoto coefficient over Morgan-count fingerprints (with radius = 3 and dimension = 2048).

Figure 3

Similarity matrices of target pairs with significant ROC-AUC improvement/degradation.

Similarity matrices of target pairs with significant ROC-AUC improvement/degradation. In Figure a and 3b, the performance of the target task NP_005152 and NP_036559 was improved from TAc-dmpna over FCN-dmpna by 34.13% and 27.10%, respectively. Figure a and 3b show that for these two target tasks (0.152 in Figure a, 0.164 in Figure b) is notably greater than both (0.134 in Figure a, 0.125 in Figure b) and (0.137 in Figure a, 0.136 in Figure b). This indicates that if active compounds across bioassays are more similar than compounds with different activity labels across bioassays TAc-dmpna can better capture the commonalities among those similar active compounds and can better transfer relevant information across bioassays. This transferred information can effectively improve the target task performance. On the other hand, if compounds with different activity labels across bioassays are more similar than compounds with the same activity labels, TAc-dmpna can cause transfer of conflicting information. Such a transfer can result in performance degradation for the target task. Such performance degradation in ROC-AUC from TAc-dmpna over FCN-dmpna for the target tasks in pairs (AAI28575, NP_066285) in Figure c was 5.74% and (AAB26273, NP_003605) in Figure d was 2.34%, respectively. In Figure c and 3d, and values are relatively similar (0.127 vs 0.122 in Figure c, 0.116 vs 0.114 in Figure d). This indicates that when the similarities between and compounds are relatively high TAc-dmpna could lead to transfer of conflicting information, causing inactive compounds in the target bioassay to be incorrectly classified as active. Furthermore, we analyzed the relation between the task-wise ROC-AUC improvement from TAc over FCN-dmpna and the bioassay similarities. Figure presents such a relation. Note that the bioassay similarities are calculated as the average of all pair-wise compound similarities across two bioassays in the same way as discussed in Section . Figure demonstrates that there are significant task-wise ROC-AUC improvements (e.g., in the upper right region) when the pair-wise similarities are relatively high (e.g., greater than 0.12), and there are marginal improvements (e.g., in the lower left region) when the pair-wise similarities are low (e.g., lower than 0.12). This suggests that if bioassay pairs are more similar TAc can improve the performance over FCN-dmpna by a large margin (e.g., more than 10%). On the other hand, if bioassay pairs are less similar, TAc achieves little or no improvement over FCN-dmpna. Indeed, there are some bioassay pairs that are more similar, yet TAc achieves marginal or negative improvement (e.g., in the lower middle region). This is possibly due to the fact that the performance improvement is not only a function of bioassay similarity; in fact, the improvement can be marginal or negative owing to poor generalization during testing.

Figure 4

ROC-AUC improvement from TAc over FCN-dmpna vs bioassay similarity.

Correctly Classified Compounds Possibly Due to More Similar Compounds in the Source Bioassay

In this section, we identified (i) a few compounds that were correctly classified by TAc but incorrectly classified by the baselines and (ii) a few compounds that were incorrectly classified by TAc but correctly classified by the baselines. Figure (a) and (b) presents two such examples for (i), and Figure (c) and (d) presents two such examples for (ii). In each figure, the left-most compound is the compound to be classified as active/inactive from the target bioassay (referred to as x(), and the others are the top 5 most similar compounds (referred to as ) to x( from the corresponding source bioassay. The mean pairwise Tanimoto coefficients between x( and in Figure (a), (b), (c), and (d) are 0.407, 0.428, 0.210, and 0.143, respectively. Thus, in Figure (a) and (b), x( values are structurally more similar to their corresponding . Relatively, in Figure (c) and (d), x( values are less similar to their corresponding . This suggests that TAc classifies some compounds correctly, probably due to the fact that those compounds have very similar compounds in the source bioassay.

Figure 5

Visualization of a few selected compounds from the target bioassay and their corresponding top 5 most similar compounds from the source bioassay.

Compound Prioritization Using dmpna

We also explored the potential of using dmpna for compound prioritization purposes. We developed a comprehensive learning-to-rank method gnnCP for effective compound prioritization that jointly learns molecular graph representations via GNN and a scoring function using the representations. The learning methods for compound prioritization are described in Section in the Supporting Information.

Materials

Baselines

We compare gnnCP with the following feature vectors using the same scoring and loss functions: (i) binary Morgan fingerprints (morgan), (ii) morgan count fingerprints, (morgan-c), (iii) bioassay-specific compound features[12] computed using the Tanimoto coefficient on binary Morgan fingerprints (morgan-ba), (iv) 200-dimensional RDKit descriptors (RDKit200), and (v) directed message passing network[19] (dmpn). We generate binary Morgan fingerprints and Morgan-count fingerprints (with radius = 2 and size = 2048) using RDKit.[90] Codes for computing the RDKit descriptors are available in the Descriptastorus package.[94]

Experimental Protocol

In order to evaluate the overall ranking performance, we perform a 5-fold cross validation. We randomly split each bioassay into five folds. In each run, four folds of each bioassay are used for training, and the other fold is used for testing. We record optimal values of each performance metric averaged over the five folds. Finally, we report the average of all such recorded optimal values of each performance metric over all the bioassays. For each bioassay, we train the models using an Adam[92] optimizer with an initial learning rate ∈ {5 × 10–3, 1 × 10–3, 5 × 10–4}. We use grid search to tune all the hyperparameters such as the dimension of the graph representation d, hidden dimension of the attention layer, and batch size. Specifically, we use d ∈ {25, 50, 100} for dmpn and dmpna and a hidden dimension of the attention layer ∈ {5, 10, 20} for dmpna. We use batch size ∈ {128, 256, 512} and λ = 1 × 10–6 for all the models. All the models are trained for 50 epochs.

Evaluation

We evaluate all the methods using a set of 105 single-target confirmatory bioassays from PubChem.[20] These bioassays all use IC50 to measure compound binding affinities and have at least 50 active compounds. For each bioassay, we only keep the active compounds and remove duplicate compounds and those with identical IC50 values. We evaluate the ranking performance using concordance index (CI), recall@k (R@k), and normalized discounted cumulative gain@k (ndcg@k),[12] where k = 3, 5, and 10. We also use R@k% and ndcg@k%, where we consider the top k% (k = 5, 10) of the test fold compounds in r. Table presents the performance comparison between dmpna, dmpn, and the baselines. Overall, dmpna significantly performs better than all the baselines including dmpn, across all performance metrics. The average performance improvement from dmpna over dmpn in terms of CI, recall@3, recall@5, ndcg@3, ndcg@5, recall@5%, and ndcg@5% is 2.353%, 6.608%, 4.460%, 3.114%, 2.421%, 18.429%, and 4.475%, respectively. Furthermore, compared to dmpn, the average bioassay-wise performance improvement from dmpna is most significant in terms of recall@3, recall@5, ndcg@3, ndcg@5, recall@5%, and ndcg@5% (p-values: 4.89 × 10–10, 9.87 × 10–13, 1.42 × 10–12, 2.25 × 10–12, 3.95 × 10–15, and 1.83 × 10–11, respectively). This indicates that dmpna can rank the topmost compounds better than dmpn. Unlike mean pooling in dmpn, the attention mechanism in dmpna can differentially focus on atoms based on the relevance of each atom to the prioritization problem. This demonstrates the ability of dmpna to better differentiate compounds and to achieve effective compound prioritization. Furthermore, dmpna and dmpn significantly outperform all the fingeprint-based baselines across all performance metrics. Compared with the best performing fingerprint-based baseline morgan-c, in terms of CI, recall@3, recall@5, ndcg@3, ndcg@5, recall@5%, and ndcg@5%, the average performance improvement from dmpna is 5.247%, 25.724%, 13.094%, 8.140%, 5.871%, 56.875%, and 10.637%, respectively, and from dmpn in terms of CI, recall@3, recall@5, ndcg@3, ndcg@5, recall@5%, and ndcg@5% is 2.827%, 17.932%, 8.266%, 4.874%, 3.369%, 32.464%, and 5.898%, respectively. This demonstrates that the learned representation out of gnnCP can effectively encode useful molecular substructure information and thus is more effective for compound prioritization.

Table 5

Overall Performance Comparison of gnnCP

method	CI	R@3	R@5	ndcg@3	ndcg@5	R@5%	ndcg@5%
morgan	0.706	0.543	0.644	0.814	0.816	0.420	0.838
morgan-c	0.711	0.545	0.655	0.815	0.819	0.437	0.846
morgan-ba	0.687	0.500	0.626	0.789	0.797	0.375	0.816
RDKit200	0.687	0.519	0.632	0.790	0.797	0.396	0.813
dmpn	0.731	0.643	0.709	0.854	0.847	0.579	0.896
dmpna	0.748	0.686	0.740	0.881	0.867	0.686	0.936
diff %	2.353	6.608	4.460	3.114	2.421	18.428	4.475
t-diff %	2.535	7.645	4.720	3.406	2.569	24.578	4.979
p-value	1.14e-10	4.89e-10	9.87e-13	1.42e-12	2.25e-12	3.95e-15	1.83e-11

In this table, the columns have the respective average of each performance metric over all bioassays obtained by the respective optimal hyperparameter settings. The best/second best performance under each metric is bold/underlined.

Conclusions

We have developed TAc that effectively leverages source bioassay data to improve the performance of the target task. We also proposed a variant of TAc, i.e., TAc-fc, that additionally learns feature-wise and compound-wise transferability. We conducted an exhaustive array of experiments and analyses that suggest that TAc-dmpna is the best-performing method on average across all target tasks. The proposed variant is also a very strong method and even better compared to TAc on certain target tasks. Furthermore, in ablation studies, we also showed that TAc-fc-dmpna can even improve performance for more than half of the target tasks compared to TAc-dmpna. Our analyses further demonstrated that learning compound-wise transferability via G can better encode the commonalities between compounds across bioassays. We also provided a parameter study to demonstrate the effect of α and λ on our proposed methods. Additionally, we demonstrated the efficacy of our proposed dmpna in both compound activity and compound prioritization problems since it performed better than any other compound representation methods. In this work, we paired the bioassays if their corresponding protein targets belong to the same protein family. In other words, when we paired the bioassays, the corresponding pair of tasks is assumed to be related. We assumed that leveraging activity information from related protein targets (i.e., targets belonging to the same protein family) can improve the target task performance. However, we observed that TAc did not improve all the targets compared to the best no-transfer baseline method FCN-dmpna. This suggests the occurrence of potential negative transfer. In future works, we will focus on developing a more principled approach to determine task-relatedness. Given a target task, our current method only considers a single source task. This severely limits the scope of transfer from only one related task and can also impact the performance on the target task if the learning is too focused on the source task. Our future work will incorporate multiple source tasks for each target task by simultaneously learning task-relatedness in a data-driven manner.

Computational Methods

Notations and Definitions

In this section, we listed the notations and definitions used in this paper. Table presents a list of notations and their meanings. We represent a compound and using a molecular graph, . is denoted as , where is the set of atoms and is the set of corresponding bonds in c. We denote the set of compounds in a bioassay B as and the activity labels of those compounds accordingly as . In this paper, we use a label “1” or “0” to indicate that a compound is active or inactive in a bioassay, respectively.

Table 6

Notations

method	meanings
c/B	compound/bioassay
= (, )	molecular graph with set of atoms and bonds
u	an atom in
(u, v)	a bond connecting atoms u and v in
	neighbors of atom u in
	set of compounds in a bioassay
	set of labels corresponding to
	input feature space
	label space
	a domain consisting of and marginal probability distribution P()
T = {Y,ω(·)}	a task consisting of label space and a decision function ω(·)
h	hidden state
r	molecular representation out of GNN
z	scaled molecular representation

We use the following definitions related to transfer learning. Domain: a domain is a set of labeled compounds , where the compounds {x} are represented in a feature space , and their activity labels {y} are represented in a label space ; is the size of the domain (i.e., the number of (x, y) pairs). In our transfer learning, we will have two domains: a source domain, denoted as , and a target domain, denoted as . In general, these two domains can have different numbers of compounds with different compound feature representations and also different label sets. We use superscript (S) and (T) to represent information associated with the source domain and the target domain, respectively. For example, x(T) represents a compound from the target domain. In addition, we use to represent the set of compound features {x}, that is, , and to represent the set of compound labels, that is, . Thus, can also be represented as . Task: Given a domain , a task is to learn a model that maps each x to its corresponding y. In our transfer learning, we will have two tasks: a source task, denoted as , and a target task, denoted as . and learn from the source domain and the target domain , respectively. Transfer Learning: Transfer learning learns and transfers information from the source task to the target task and helps improve the performance of . The underlying assumptions are that: (1) the target domain does not have sufficient information for to learn a good model, and (2) there are commonalities between and ; such commonalities can be transferred from to and used to improve .

Methods

In this section, we present our two transfer learning methods: TAc and TAc-fc. We first introduce the overall architecture of TAc in Section and then discuss each component in detail in subsequent sections (i.e., Sections and 4.2.3). We discuss the end-to-end optimization process in Section . We then introduce TAc-fc with additional components that learn feature-wise and compound-wise transferability and finally discuss the optimization process in Section .

Overall Architecture of TAc

TAc learns to generate transferable features that can generalize well from one domain to another and increases the predictive power for classification in the target domain. Figure presents the overall architecture of the proposed TAc. TAc consists of two components: (1) a feature learner F that learns to represent chemical compounds and (2) a domain-wise classifier S that classifies chemical compounds of each domain. Below, we discuss each component of TAc in detail.

Figure 6

Proposed architecture of TAc. The feature learner F learns compound representations r given the corresponding molecular graph. The domain-wise classifier S classifies the compound as active/inactive.

Learning Compound Representations

This section describes how the feature representations of compounds are learned. In TAc, the feature representations of chemical compounds are learned in a data-driven fashion. Compared to using static fingerprints or fixed feature representations of molecular structures,[33] such learned features will be more adapted to the learning task and enable optimal performance. We leverage the popular idea of graph neural networks[95] and use the Directed Message Passing Neural Network, denoted as dmpn, developed in Yang et al.[19] Given a molecular graph for a compound c, dmpn learns a feature vector, also called an embedding of c using graph convolution, by passing messages along directed edges over molecular graphs. In dmpn, two representations for each bond are learned via message passing through the two directions along the bond. Then atom representations are learned from the representations of their incoming bonds. In the end, the compound representation is generated via mean pooling over all the atom representations. Details about dmpn are presented in Section 1 in the Supporting Information and are also available in Yang et al.[19] Based on dmpn, we further improve the compound representation learning by introducing an attention mechanism inspired from Graph Attention Networks.[96] This new method is referred to as dmpn with attention, denoted as dmpna. Specifically, we replace the mean pooling in dmpn with an attention-based pooling mechanism as followswhere ⊙ is the element-wise product; su is the learned representation of atom u as in dmpn; and wu is the attention weight on atom u calculated as followswhere fa(·) is a 2-layer feed-forward network with a ReLU activation function after the hidden layer. That is, the attention learns a specific weight on each atom. Thus, the attention mechanism in dmpna can differentially focus on atoms based on the relevance of each atom toward the final predictive task. The network to learn compound embeddings is denoted as F (i.e., F is dmpn or dmpna).

Learning to Classify Compounds of Each Domain

This section describes how the compounds of each domain are classified as active/inactive using the learned feature representations. Given the compound embedding r, the domain-wise classifier classifies each compound in a given domain as active or inactive with respect to that domain using a two-layer fully connected neural network S as followswith ReLU at the hidden layer and sigmoid at the output layer. The outputs of S are the probabilities of input compounds from the source/target domain being active in the source/target domain. To learn S, the loss function for the classifier is defined as followswhere y(S)/y(T) is the ground-truth activity label of each compound in domain S/T; n(S)/n(T) is the number of compounds in / (i.e., ); α is a hyperparameter to trade-off the two classification losses; and Ω and Φ are learnable parameters of F and S, respectively. Please note that both the source domain and the target domain use the same classifier S. Therefore, if the source and target domain have common compounds or very similar compounds, when these compounds have the same labels in the two domains, they will induce small classification errors in both domains; when these compounds have different labels in the two domains, they will induce large errors in one domain and small errors in the other. By minimizing the loss (Ω, Φ), it will encourage common or similar compounds that have the same labels in the two domains to be more focused on through learning and prevent the transfer of conflicting information across domains.

TAc Model Optimization

This section presents the optimization process of the proposed TAc. TAc constructs an end-to-end transfer learning framework with the above two components: (1) feature learner F and (2) domain-wise classifier S. We solve for TAc through minimizing the loss function . In other words, we solve the following optimization problemwhere Ω and Φ are the learnable parameters of F and S, respectively. Minimizing will minimize the classification error in each domain while preventing transfer of conflicting information across domains, hence enabling the feature learner F to learn better compound features for effective classification in each domain. Since the same F and S are used for both the source and target tasks, minimizing also enables transfer of relevant information through the shared parameters of F and S. Intuitively, the amount of transferable information from the source domain to the target domain is determined by the degree of task relatedness between those domains. In this work, the degree of task relatedness between the source and target domains is essentially controlled through the hyperparameter α. We will consider learning task relatedness or α in a data-driven manner in our future works.

Variant of TAc: TAc-fc

In this section, we propose a variant of TAc where we incorporate additional components to selectively learn feature-wise and compound-wise transferability. We denote this variant as TAc-fc. Figure presents the overall architecture of the proposed TAc-fc. In addition to the feature learner F and the label classifier S, TAc-fc consists of two more components: (1) a feature-wise discriminator L that learns the transferability of each learned feature (Section ) and (2) a compound-wise discriminator G that separates chemical compounds into source and target domains (Section ). We refer to TAc with the feature-wise discriminator only as TAc-f and TAc with the compound-wise discriminator only as TAc-c. Below, we discuss each component in detail and also the optimization of the proposed method. Proposed architecture of TAc-fc. The feature learner F learns compound embedding r given the corresponding molecular graph. The feature-wise discriminator L learns feature-wise transferability given the learned compound embedding r. r is further scaled into z using its feature entropy from p out of L. The compound-wise discriminator G learns the compound-wise transferability given z. The domain-wise classifier S classifies the compound as active/inactive.

Learning Transferability of Individual Features

Given the learned compound embedding out of F (discussed in Section ), the feature-wise discriminator of TAc-fc learns the transferability of each embedding feature in r using a two-layer neural network L as followswhere L has a hidden layer with ReLU and an output layer with sigmoid. Note that p = [p1, p2, ..., p] has the same dimension as r, and p ∈ [0, 1] represents the probability that the i-th embedding feature in r is specific to the source domain. Thus, the feature-wise discriminator determines whether the input compound features (not the input compounds) belong to the source domain or not. For bioactivity prediction problems, if and have compounds for protein targets that are from the same protein family, it is very likely that their active compounds are similar and share similar substructures (e.g., pharmacophores). In this case, intuitively, the feature-wise discriminator here could learn and represent such similar substructures. We further quantify the transferability of each embedding feature using its entropies as followsIf p is very large or very small and has a low entropy, it indicates the i-th embedding feature is very likely or very unlikely to be specific to the source domain, and thus it is less likely to be common across domains; if p is close to 0.5 and with a high entropy, the feature is less specific to any of the domains and more likely to be common across domains and therefore can be used for information transfer across domains. We then scale compound embedding r into z using feature entropies as followswhere and ⊙ represents element-wise dot product. Each feature is scaled with its entropy and added with itself. Intuitively, the self-addition reduces the loss of informative features due to improper scaling. Thus, in z, domain-invariant embedding features are scaled larger than domain-specific embedding features (H ≥ 0). We will use z as input to the following components. To learn the feature-wise discriminator, the loss function is defined as followswhere n(S)/n(T) is the number of compounds in / (i.e., ); Ω and Θ are learnable parameters of F (compound representation learning network as in Section ) and L (feature-wise discriminator network as in eq ), respectively, and d is the dimension of compound feature embeddings. Note that in eq p(S) and p(T) both measure an embedding feature’s probability of being specific to the source domain; superscripts (S)/(T) here indicate that the compounds, whose features are measured, are from the source/target domain, respectively. To have an accurate feature-wise discriminator, embedding features specific to the source/target domain should have large/small probabilities (i.e., large p(S) and small p(T)) with respect to the source domain and thus make the value small. Therefore, minimizing will encourage accurate probabilities. Meanwhile, the feature learner F should encourage the learning of more transferable embedding features, which will have probabilities close to 0.5 and thus make the value large. Therefore, maximizing will encourage more transferable embedding features being learned and learned well. To combine these two aspects, an adversarial optimization will be applied to as will be described later in Section .

Learning Transferability of Compounds

Inspired by the principle that similar compounds tend to bind to similar protein targets, our method identifies such similar compounds that have the same activity labels across two targets and hence learns compound-wise transferability. Given the scaled compound embedding z of compound c, the compound-wise discriminator classifies whether the compound is from the source domain using a two-layer fully connected neural network G as followswith ReLU at the hidden layer and the sigmoid at the output layer. If q is very large or very small, c is very likely or very unlikely to belong to the source domain (it is equivalent to calculating the value with respect to the target domain since there are only two domains to consider). If q is close to 0.5, c is likely to be common across domains (e.g., identical or similar compounds in the two domains) and thus can be used for information transfer across domains. To learn the compound-wise discriminator, the loss function is defined as followswhere n(S)/n(T) is the number of compounds in / (i.e., ); Ω and Ψ are learnable parameters of F (compound representation learning network as in Section ) and L (eq ); and d is the dimension of the compound feature embeddings. Note that in eq q(S) and q(T) represent the probability of c(S) and c(T) belonging to the source domain. Also, all the compounds from the source and target domains will be predicted using the same G. In order to identify similar compounds across domains, the discriminator needs to identify compounds with their q values close to 0.5; when the q values are close 0.5, will be maximized. Therefore, maximizing will encourage more transferable compounds being learned and learned well. Meanwhile, to have an accurate compound-wise discriminator, compounds specific to the source/target domain should have large/small probabilities (i.e., large q(S) and small q(T)) with respect to the source domain and thus make the value small. Therefore, minimizing will encourage accurate probabilities. To combine these two aspects, similarly as to , an adversarial optimization will be applied to as will be described later in Section . According to G, a compound that is common in the two domains or is similar to compounds in the other domain could be transferable (q value close to 0.5; not specific to the source or target domain). However, such common or similar compounds may have different activity labels in the two domains. Using transferred information from common/similar compounds with conflicting labels in will confuse any learners adversely. The compound-wise discriminator G does not consider activity label information in learning compound transferability and thus possibly induces conflicting information into . However, in the downstream domain-wise classification (Section ), the minimization of domain-specific classification errors will prevent the transfer of conflicting information. However, the input to the domain-wise classifier S in TAc-fc is z instead of r as in Section . Given the scaled compound embedding z, the domain-wise classifier classifies each compound in a given domain as active or inactive with respect to that domain using a two-layer fully connected neural network S as followswith ReLU at the hidden layer and sigmoid at the output layer. As discussed in Section , minimizing the loss (Ω, Φ) enables correct classification in each domain and prevents the transfer of conflicting information across domains.

TAc-fc Model Optimization

This section presents the optimization process of the proposed TAc-fc. TAc-fc constructs an end-to-end adversarial transfer learning framework with the above four components: (1) compound feature presentation learning network F, (2) feature-wise discriminator L, (3) compound-wise discriminator G, and (4) domain-wise classifier S. We solve for TAc-fc through optimizing the following loss functionwhere Ω, Θ, Ψ, and Φ are learnable parameters of F, L, G, and S, respectively, and λ is a trade-off parameter. This loss function combines the three loss functions for L, G, and S and will be optimized in an adversarial way as follows: (Step 1). Minimize with respect to Ω and Φ via solving the following optimization problem:By minimizing , we essentially minimize and maximize and . As discussed in L (Section ) and G (Section ), maximizing and will encourage learning of transferable features and compounds that can be used to help the tasks; as discussed in S (Section ), minimizing will prevent the transfer of conflicting information, in addition to minimizing the classification errors in each task. (Step 2). Maximize with respect to Θ and Ψ via solving the following optimization problem:By maximizing , we essentially minimize and ( is fixed in this step). As discussed in L (Section ) and G (Section ), minimizing and will encourage that L and G accurately learn features and compounds that are specific to each domain to improve the classification performance of each domain. (Step 3). The above two steps are iterated until the learning converges. Thus, the optimization problem consists of a maximization with respect to some parameters and a minimization with respect to the others. In order to tackle such a mini-max optimization, we insert the gradient reversal layer (GRL)[64] between F and the discriminators L and G. GRL reverses the gradients during the backward propagation and hence optimizes parameters Ω by maximizing the discriminator loss.

58 in total

1. Interpretation of quantitative structure-property and -activity relationships.

Authors: A R Katritzky; R Petrukhin; D Tatham; S Basak; E Benfenati; M Karelson; U Maran
Journal: J Chem Inf Comput Sci Date: 2001 May-Jun

Review 2. Computational methods in drug discovery.

Authors: Gregory Sliwoski; Sandeepkumar Kothiwale; Jens Meiler; Edward W Lowe
Journal: Pharmacol Rev Date: 2013-12-31 Impact factor: 25.468

3. Graph convolutional networks for computational drug development and discovery.

Authors: Mengying Sun; Sendong Zhao; Coryandar Gilvary; Olivier Elemento; Jiayu Zhou; Fei Wang
Journal: Brief Bioinform Date: 2020-05-21 Impact factor: 11.622

4. Demystifying Multitask Deep Neural Networks for Quantitative Structure-Activity Relationships.

Authors: Yuting Xu; Junshui Ma; Andy Liaw; Robert P Sheridan; Vladimir Svetnik
Journal: J Chem Inf Model Date: 2017-10-02 Impact factor: 4.956

5. Deep Transferable Compound Representation across Domains and Tasks for Low Data Drug Discovery.

Authors: Karim Abbasi; Antti Poso; Jahanbakhsh Ghasemi; Massoud Amanlou; Ali Masoudi-Nejad
Journal: J Chem Inf Model Date: 2019-11-08 Impact factor: 4.956

6. ADME evaluation in drug discovery. 8. The prediction of human intestinal absorption by a support vector machine.

Authors: Tingjun Hou; Junmei Wang; Youyong Li
Journal: J Chem Inf Model Date: 2007-10-12 Impact factor: 4.956

7. Drug discovery using support vector machines. The case studies of drug-likeness, agrochemical-likeness, and enzyme inhibition predictions.

Authors: Vladimir V Zernov; Konstantin V Balakin; Andrey A Ivaschenko; Nikolay P Savchuk; Igor V Pletnev
Journal: J Chem Inf Comput Sci Date: 2003 Nov-Dec

Review 8. Transfer and Multi-task Learning in QSAR Modeling: Advances and Challenges.

Authors: Rodolfo S Simões; Vinicius G Maltarollo; Patricia R Oliveira; Kathia M Honorio
Journal: Front Pharmacol Date: 2018-02-06 Impact factor: 5.810

9. Analyzing Learned Molecular Representations for Property Prediction.

Authors: Kevin Yang; Kyle Swanson; Wengong Jin; Connor Coley; Philipp Eiden; Hua Gao; Angel Guzman-Perez; Timothy Hopper; Brian Kelley; Miriam Mathea; Andrew Palmer; Volker Settels; Tommi Jaakkola; Klavs Jensen; Regina Barzilay
Journal: J Chem Inf Model Date: 2019-08-13 Impact factor: 4.956

Review 10. A comprehensive map of molecular drug targets.

Authors: Rita Santos; Oleg Ursu; Anna Gaulton; A Patrícia Bento; Ramesh S Donadi; Cristian G Bologa; Anneli Karlsson; Bissan Al-Lazikani; Anne Hersey; Tudor I Oprea; John P Overington
Journal: Nat Rev Drug Discov Date: 2016-12-02 Impact factor: 84.694