Literature DB >> 31518132

Learning Drug Functions from Chemical Structures with Convolutional Neural Networks and Random Forests.

Jesse G Meyer, Shengchao Liu, Ian J Miller, Joshua J Coon, Anthony Gitter.

Abstract

Empirical testing of chemicals for drug efficacy costs many billions of dollars every year. The ability to predict the action of molecules in silico would greatly increase the speed and decrease the cost of prioritizing drug leads. Here, we asked whether drug function, defined as MeSH "therapeutic use" classes, can be predicted from only a chemical structure. We evaluated two chemical-structure-derived drug classification methods, chemical images with convolutional neural networks and molecular fingerprints with random forests, both of which outperformed previous predictions that used drug-induced transcriptomic changes as chemical representations. This suggests that the structure of a chemical contains at least as much information about its therapeutic use as the transcriptional cellular response to that chemical. Furthermore, because training data based on chemical structure is not limited to a small set of molecules for which transcriptomic measurements are available, our strategy can leverage more training data to significantly improve predictive accuracy to 83-88%. Finally, we explore use of these models for prediction of side effects and drug-repurposing opportunities and demonstrate the effectiveness of this modeling strategy for multilabel classification.

Entities: Chemical Disease Gene Species

Year: 2019 PMID： 31518132 PMCID： PMC6819987 DOI： 10.1021/acs.jcim.9b00236

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

Development of molecules with new or improved properties is needed in many industries, including energy, agriculture, and medicine. However, the number of possible molecules to explore, also referred to as chemical space, is exceedingly large.[1,2] Even when chemical space is limited to compounds that conform to “Lipinski’s rule of five”,[3] which applies to the subtask of drug development, there are still as many as 1060 possible chemical structures.[4] Regardless of the available chemical diversity, the pace of new drug approvals has steadily decreased, leaving room for new approaches that can improve the current process. A promising approach for discovering new drug molecules is machine learning,[5,6] which includes so-called deep learning using deep neural networks (DNNs).[7] Many studies describe methods for embedding molecules into a latent space and engineering molecules with desirable properties.[8−10] Reinforcement learning has been applied with paired DNNs to design molecules with desired properties, such as solubility or transcription factor inhibition.[11] A framework for benchmarking model predictions is available.[12] One study used a generative adversarial neural network architecture to generate molecules that should induce specific transcriptomic states.[13] There are many ways to represent molecules for machine learning. Many papers use SMILES strings[14−16] as molecular inputs for embedding, but there is a trend toward the use of molecular graphs.[17−19] A general weakness of DNNs is that they perform best with large amounts of training data (100 000 to millions of examples, e.g., ImageNet[20]). However, DNNs can be used for problems with small training data sets through transfer learning, where networks are trained on a large data set for one problem and adapted for a related problem that has less training data.[21−23] For example, transfer learning has been applied to classification of fewer than 6000 medical ultrasound images,[24] only 2000 oceanfront images,[25] or fewer than 1000 cellular images,[26] even though these networks were pretrained on images of completely different objects. One type of DNN for structured data such as sequences and images, the convolutional neural network (CNN), has enabled major advances in image-processing tasks in diverse fields. In chemistry, several papers have described excellent performance resulting from the use of two-dimensional images of chemicals with CNNs. This approach has been used effectively to predict chemical toxicity[27] with regard to the 12 biological toxicity endpoints in the Tox21 challenge.[28] CNNs for chemical images have also been described as a general-purpose molecule property prediction tool despite their lack of explicit chemistry knowledge.[29] These authors found that augmenting the same deep-learning architecture with only three additional chemical properties further improved model performance,[30] suggesting that chemical images alone may not entirely capture the important characteristics of a chemical. Finally, images and CNNs have been used to predict drug–protein interactions and outperformed models trained on flattened versions of the images, which cannot exploit the spatial structure.[31] The various ways to measure and represent molecules lead to a philosophical question about the nature of chemicals,[32] and the related fundamental question of whether a single representation can completely describe a chemical entity (reviewed in ref (9)). As described above, several chemical-structure-derived embeddings are often used for cheminformatics, such as chemical images or circular molecular fingerprints. Alternatively, molecules can be represented by an analytical measurement[33] or by their influence on biological systems (e.g., transcriptomic or morphological changes).[34,35] There are open questions regarding the relative utility of these various chemical representation strategies for different predictive tasks relevant to drug development. In this paper, we use machine learning and chemical-structure-derived molecule representations to predict specific medical subheading (MeSH) “therapeutic uses” classes.[36] We first performed the same classification task with the same set of 676 molecules previously selected by Aliper et al.[34] In contrast to our chemical-structure-derived models, Aliper et al. used molecule-induced transcriptome changes from the LINCS project[37] as a proxy molecule representation. We employed two strategies: (1) chemical images with CNNs and (2) Morgan fingerprints (MFPs)[38] with random forests (RFs)[39] (Figure ). We chose to use the CNN with images because of extensive precedent for the effectiveness of this pair, and we chose to use MFP with RF because we and others have seen excellent performance of this representation-model pair.[40] Our goal was to assess whether drug function classifier models trained with readily available chemical structure inputs can outperform models trained with empirical measures of drug effects. Our results support the effectiveness of chemical-structure-based models. Both classification models trained with chemical-structure-derived features greatly outperform the previous benchmark based on drug-induced transcriptomic changes. Furthermore, because we require only a chemical structure, the models can be greatly improved by training on more than 6000 additional compounds that do not have associated transcriptomic data. Our main contribution is that chemical structures alone are effective predictors of therapeutic use classes.

Figure 1

Structure-based drug classification pipelines. Chemicals from 12 medical subheadings (MeSH) therapeutic use classifications were converted to either two-dimensional color molecule images or Morgan molecular fingerprints. Molecule images were used to train a convolutional neural network (IMG + CNN) classifier, and fingerprints were used to train a random forest (MFP + RF) classifier. The models were used separately to predict classes of drugs using stratified cross-validation.

Methods

All Python code for the CNN and RF models and the preprocessed data sets are available from https://github.com/jgmeyerucsd/drug-class. The release version 0.1 of the repository including all Python code and data has also been archived at Zenodo (https://doi.org/10.5281/zenodo.3385194).

Data

A primary goal of this work was to compare empirically derived chemical features, such as the transcriptome-based model from Aliper et al.,[34] with chemical-structure-derived representations. Therefore, in the first evaluation, we emulated their framing of the prediction task, which is to predict 1 of 12 MeSH therapeutic use classes of chemicals. The specific version of the data used by Aliper et al., including training/validation groups, is unavailable. Therefore, our exact training and validation sets are different. To make the fairest-possible comparison, we constructed a data set following the same guidelines as Aliper et al. Molecules were selected from PubChem[41] on October 2nd, 2018 according to their MeSH therapeutic uses classification (chemicals and drugs category > chemical actions and uses > pharmacologic actions > therapeutic uses). Although there are 20 high-level categories, we used only the 12 classes described previously.[34] Molecules in these 12 classes were downloaded in a spreadsheet containing their compound identification numbers (CIDs). A total of 11 929 CIDs were converted to SMILES strings using the Python package pubchempy (https://github.com/mcs07/PubChemPy). SMILES strings with a length of more than 400 or membership to more than 1 of 12 MeSH therapeutic classes were excluded, leaving 8372 SMILES. For this analysis, chemicals in multiple classes were excluded as described previously to enable direct comparison.[34] This final list was filtered to remove multiple versions of molecules that differ by only accompanying salts. The final filtered total was 6955 molecules. The distribution of molecules among classes is given in Table . This set of all molecules was divided into five folds stratified based on the class for cross-validation.

Table 1

Summary of Examples across Data Classes and Task Subgroups

MeSH therapeutic uses	# Aliper et al.	# total	task subgroups
antineoplastic	111	1177	3, 5, 12
cardiovascular	125	788	3, 5, 12
central nervous system (CNS)	172	1139	3, 5, 12
anti-infective	141	2398	5, 12
gastrointestinal	30	258	5, 12
anti-inflammatory	19	373	12
dermatological	6	116	12
hematologic	17	267	12
lipid regulating	19	164	12
reproductive control	16	148	12
respiratory system	11	101	12
urological	9	26	12

SMILES strings were converted into three-color (RGB) images with size 500 × 500 pixels or 1024 bit Morgan fingerprints using the Python package RDKit.[42] Images generated by RDKit always fit the entire molecule structure, so molecules of different sizes are not problematic. All images used for training and validation are available on GitHub. Molecule classes were then split into three subgroups for model training and prediction: 3-, 5-, or 12-class prediction tasks according to the groupings described previously by Aliper et al. (Table ). For the comparison with Aliper et al.’s results, we took the list of molecules in their Table S1 (Supporting Information) and retrieved SMILES strings from PubChem using pubchempy. Images were generated as described above. During the removal of salts from their original set of 678 molecules, we found that two drugs, one from the anti-infective class and one from the CNS class, were the same molecules with different salt pairs. The copies of these two duplicate molecules were removed. The numbers of chemicals in this smaller data set are given in Table . This set of 676 molecules was split into 10 folds stratified based on class membership to mimic the methods or Aliper et al. as closely as possible. However, the dermatological and urological classes have less than 10 molecules and therefore are missing validation examples in some folds. For those folds missing validation examples, the receiver operator characteristic area under the curve (ROC AUC) and average precision metrics were not computed.

Images with Convolutional Neural Networks (IMG + CNN): Single Label

Molecule images with RGB channels were resized to 150 × 150 pixels and used for retraining and validation of a CNN with predetermined weights from resnext101_64[43] implemented using fastai and pytorch.[44] The loss function used was binary cross entropy, and the output layer was logsoftmax. A cyclic cosine annealing learning rate was used during training,[45] which decreases from the initial setting toward 0 over a number of epochs. The number of epochs needed to decay the learning rate to the final value was doubled every cycle. An example of the learning rate versus batch is shown in Figure S1 along with the corresponding training loss. To determine the best hyperparameters for all CNN models and data subsets, we first performed hyperparameter optimization on the small set of 676 compounds. Hyperparameter optimization was done with nested 10-fold cross-validation using class-stratified folds. Varied hyperparameters were as follows: (1) dropout proportions of 20, 40, or 60%, (2) retraining all weights or only the output layer weights, and (3) the initial learning rates for cosine annealing ([5 × 10–5, 5 × 10–4, and 5 × 10–3] or [1 × 10–4, 1 × 10–3, and 1 × 10–2] for early, middle, and output layers, respectively). Fixed training hyperparameters were the batch size of 25, seven cycles of cosine annealing learning rate with decay rate decreased by half each cycle (totaling 127 epochs), and data augmentation with random zooms of up to 10% and random horizontal or vertical image flips. Average accuracy values from each of the hyperparameter groups tested during the inner loops of nested cross-validation are given in Table S1. Based on the results of this hyperparameter search, the hyperparameters that most often resulted in the best accuracy on the inner loop fold were used for training all other models, including CNNs trained on the larger set of 6955 compounds. The tested learning rates had a minimal effect on accuracy. The largest effect on accuracy resulted from retraining all weights instead of training only the output weights. These best hyperparameters from the grid were (1) 40% dropout, (2) retraining all weights, and (3) the higher learning rate set of [1 × 10–4, 1 × 10–3, and 1 × 10–2] for early, middle, and output neuron layer groups, respectively.

Molecular Fingerprints with Random Forests (MFP + RF)

Random forests[39] are ensembles of decision trees, where each tree is learned on a subsample of data points and features (in this case, bits in a molecular fingerprint). Benchmarking studies often include MFP + RF models because they are easy to train and have strong performance on a variety of computational chemistry tasks.[12,40,46−50] The random forest model was implemented with scikit-learn.[51] Separate hyperparameter grid searches (216 combinations, Table ) were performed for the 676 and 6955 compound analyses in a nested cross-validation setting. For each outer loop, the best set of hyperparameters was selected based on the inner loop cross-validation accuracy. These hyperparameters were then used to train on all of the inner loop compounds and assess performance on the outer loop validation set.

Table 2

Hyperparameter Values for the Random Forest Classifier That Were Explored by Grid Search

parameter	values
# estimators	50, 250, 1000, 4000, 8000, 16 000
max features	none, sqrt, Log 2
min sample leaf	1, 10, 100, 1000
class weight	none, balanced subsample, balanced

Comparison of Druglike Properties

CIDs were used to download molecular weight, XLog P, HBondAcceptorCount, HBondDonorCount, and IsomericSMILES values using pubchempy. Compounds were then filtered to include nonredundant IsomericSMILES values and only drugs with a single-class label (Table S2). XLog P values are computed[52] rather than measured and were not available for all queried compounds (Table S3). Violin plots were created using ggplot2 (https://ggplot2.tidyverse.org/). For each quantitative feature, a Welch’s analysis of variance (ANOVA) and Games-Howell posthoc test (R package userfriendlyscience, https://cran.r-project.org/web/packages/userfriendlyscience/index.html) were used to compare differences between chemical features between drug class groups. This test was selected because it does not assume a normal distribution, even variance, or equal sample sizes between groups.[53] Adjusted p-values from the Games-Howell posthoc test are reported in Table S4. Drug-class-level distribution and pairwise relations of chemical features were further visualized with Seaborn (https://seaborn.pydata.org/).

Single-Label Classification Models: Training, Comparison, and Evaluation

Training data for 676 molecules with transcriptomic measurements available was split into 10 folds for cross-validation, and the training data for all 6955 available annotated molecules in the 12 MeSH classes was split into 5-folds. When referring to model performance and metrics, all values are from the held-out folds referred to as validation folds, and metrics are the average performance of the validation folds unless otherwise specified. Although it is often recommended to use chemical-structure-based clustering or more advanced techniques[54] to create cross-validation folds, we randomly split molecules into folds. We checked the chemical similarity of our five folds of the larger data set using ChemTreeMap[55] and found the folds to be randomly distributed in chemical space (Figure S2). Therefore, the validation fold molecules are within the applicability domain[56,57] of the trained models. Consequently, the evaluation could overestimate the performance of models on new molecules outside this applicability domain. Trained models were evaluated using the following metrics from scikit-learn 0.20.3: accuracy, balanced accuracy, Matthew’s correlation coefficient (MCC), ROC AUC, and average precision score. The classification accuracy of the five validation sets was used as the primary model comparison metric. Class-specific prediction accuracy of the models was compared using confusion matrices for a single representative validation fold. We also compared our prediction accuracy with the accuracy previously reported by Aliper et al.[34] However, it should be clearly noted that although we used the same molecules, the exact training and validation sets were unavailable, so the accuracies are not perfectly comparable.

IMG + CNN: Multilabel Classification

The set of all molecules including those with multiple class memberships (8336 molecules assigned a total of 9885 classes, an average of 1.2 classes per molecule) was used to train additional convolutional neural networks for multilabel classification using the fastai package. The data was split into five folds based on pairwise class co-occurrence using the iterative class splitter from the skmultilearn package.[58] A Jupyter notebook containing the code used to train the models is available on the GitHub repository under multiclass_data/multiclass_5foldCV.ipynb. Resnet50 was used as the pretrained model and weights, and 40% dropout was used with image data augmentation. Training images were 256 × 256 and were processed in batches of 40. All weights were retrained for each CNN model for 127 epochs (the same number as for a single class) using the updated one-cycle policy.[59] The multilabel classification was evaluated by computing thresholded accuracy and Fβ (β of 2.0, the fastai default) using a default score cutoff of 0.5. ROC AUC and average precision scores were also computed as described for the single-label classification models using the weighted average. Finally, a network of the class relationships was computed using the pairwise co-occurrence of classes using the networkx (https://networkx.github.io/) and igraph (https://igraph.org/python/) Python packages according to the skmultilearn tutorial (http://scikit.ml/labelrelations.html). Network graphs were visualized with their edge width proportional to the strength of the node relationship as defined by the number of co-occurrences of the classes.

Results

Classification with a Small Benchmark Data Set

Two chemical-structure-derived representations were used for training and classification with two different model architectures: (1) IMG + CNN or (2) MFP + RF (Figure ). Molecules were split into three subtask sets as described previously.[34] Each subtask set contained 408, 579, or 676 molecules for the 3-, 5-, and 12-class problems, respectively. Table gives a summary of the validation set accuracy for the models described here in comparison with results from Aliper et al. who used a multilayer perceptron DNN or support vector machine (SVM) with gene expression changes as the model input. For 5- and 12-class subtasks, MFP + RF performed best, achieving 64.1% accuracy on the 12-class prediction task, representing an improvement over the expression-based DNN that achieved only 54.6% accuracy. The IMG + CNN model produced accuracy similar to the MFP + RF for the 3-class subtask but achieved about 4% points worse accuracy on the 5- and 12-class problems. However, IMG + CNN models still significantly outperformed the previous gene-expression-based models in all cases. Accuracy can be inflated when the class labels are not evenly distributed (Table ). Balanced accuracy and MCC are more robust with skewed classes but were not reported for the gene-expression-based models. For the IMG + CNN and MFP + RF, these values are lower than the accuracy but good overall.

Table 3

Average Metrics for Each of 10 Hold-Out Folds from Cross-Validation Using 676 Molecules from Aliper et al. Annotated with Only One of the 12 MeSH Classesa

problem group	metric	SVM¹	DNN²	IMG + CNN³	MFP + RF⁴
3-class	accuracy	0.53	0.701	0.747 ± 0.0657	0.742 ± 0.0692
	balanced accuracy			0.739 ± 0.0644	0.715 ± 0.0766
	MCC			0.619 ± 0.102	0.612 ± 0.106
	ROC AUC			0.870 ± 0.0412	0.894 ± 0.0417
	ave. precision score			0.806 ± 0.0592	0.847 ± 0.0588
5-class	accuracy	0.417	0.596	0.653 ± 0.0451	0.694 ± 0.0497
	balanced accuracy			0.620 ± 0.0509	0.635 ± 0.0661
	MCC			0.549 ± 0.0599	0.606 ± 0.0660
	ROC AUC			0.867 ± 0.0322	0.892 ± 0.0284
	ave. precision score			0.735 ± 0.0568	0.791 ± 0.0471
12-class	accuracy	0.366	0.546	0.608 ± 0.0500	0.641 ± 0.0331
	balanced accuracy			0.507 ± 0.107	0.504 ± 0.0522
	MCC			0.525 ± 0.0620	0.572 ± 0.0388
	ROC AUCb			0.863 ± 0.209	0.896 ± 0.0200
	ave. precision scoreb			0.672 ± 0.0303	0.751 ± 0.0205

Receiver operator characteristic area under the curve (ROC AUC) and average precision score were computed as the weighted average of scores across classes and only computed for the first six validation sets of the 12-class problem due to fewer than 10 examples in the dermatological and urological classes.

Values for the gene-expression-based models are from Aliper et al. who used different training and validation folds for 10-fold cross-validation with a 1support vector machine (SVM) or 2multilayer perceptron deep neural network (DNN) based on pathway activation scores. Values from this paper using 3molecule images input to a convolutional neural network (IMG + CNN) or 4Morgan molecular fingerprints as the input to the random forest (MFP + RF). Values for 3,4 are the mean of the validation folds ± standard deviation. Receiver operator characteristic area under the curve (ROC AUC) and average precision score were computed as the weighted average of scores across classes and only computed for the first six validation sets of the 12-class problem due to fewer than 10 examples in the dermatological and urological classes.

Classification with All Annotated Chemicals

A major limitation of using empirical chemical features generated from biological experiments, such as gene expression, is that time and financial costs of experiments limit the size of the training data. For the drug function prediction task, there are roughly 10× more molecules available annotated with the 12 MeSH classes than the number of molecules with transcriptomic data. To highlight the value of using a chemical-based representation instead of an empirical representation, we trained additional models with all available 6955 chemical structures. The use of more training data was greatly beneficial to both representation-model pairs resulting in accuracies of 83–88% and ROC AUC values of more than 0.969 (Table ). ROC curves for predictions from the IMG + CNN model are shown in Figure , and curves for the MFP + RF are shown in Figure S3. With this larger training data set, the five-fold cross-validation evaluation metrics are quite similar for the IMG + CNN and MFP + RF models. The MFP + RF model has a slight advantage over the IMG + CNN model when using the metrics that consider the complete rankings of chemicals by predicted class probabilities (ROC AUC and average precision).

Table 4

Average Metrics for Each of Five Validation Folds from Cross-Validation Using the Full Set of 6955 Molecules Annotated with Only One of the 12 MeSH Classes

problem group	metric	IMG + CNN	MFP + RF
3-class	accuracy	0.884 ± 0.0108	0.882 ± 0.0142
	balanced accuracy	0.879 ± 0.0143	0.870 ± 0.0162
	MCC	0.823 ± 0.0168	0.822 ± 0.0217
	ROC AUC	0.970 ± 0.0063	0.978 ± 0.00382
	ave. precision score	0.950 ± 0.0108	0.978 ± 0.00382
5-class	accuracy	0.863 ± 0.0104	0.871 ± 0.00700
	balanced accuracy	0.828 ± 0.0167	0.822 ± 0.0183
	MCC	0.811 ± 0.0140	0.821 ± 0.00969
	ROC AUC	0.972 ± 0.0046	0.981 ± 0.00284
	ave. precision score	0.933 ± 0.0093	0.950 ± 0.00582
12-class	accuracy	0.834 ± 0.0084	0.838 ± 0.00677
	balanced accuracy	0.735 ± 0.0258	0.719 ± 0.0248
	MCC	0.793 ± 0.0105	0.797 ± 0.00831
	ROC AUCa	0.969 ± 0.0026	0.977 ± 0.00227
	ave. precision scorea	0.900 ± 0.0073	0.918 ± 0.00392

Receiver operator characteristic area under the curve (ROC AUC) and average precision score were computed as the weighted average of scores across classes.

Figure 2

Receiver operator characteristic curves from the IMG + CNN model predictions on the fifth validation set of the (A) 3-, (B) 5-, and (C) 12-class data sets. Performance on this example fold is representative of the performance on all five folds shown in Table . Receiver operator characteristic area under the curve (ROC AUC) and average precision score were computed as the weighted average of scores across classes.

Model and Representation Comparisons

Given the unequal stratification among examples within classes in the training and validation sets, the per-class performance of both models was compared on one representative validation fold. Confusion matrices enable this comparison by showing the fraction of predicted classes for each true class. Confusion matrices from IMG + CNN model predictions for each subtask revealed differences in class-wise prediction accuracy (Figure ). In the 3-class subtask, the prediction performance was similar among the three groups. The most difficult class to predict was cardiovascular drugs; 13% of cardiovascular drugs were predicted incorrectly as CNS drugs (Figure A). In the 5-class subtask, which includes the 3-class drugs and gastrointestinal and anti-infective drugs, classification accuracy was generally worse than the 3-class subtask (Figure B). The proportion of correct predictions roughly followed the number of examples available with anti-infective drugs predicted at high correctness (93%). Gastrointestinal drugs are the smallest class but are predicted more correctly (78%) than cardiovascular drugs (76%). In the 12-class prediction task, the correctness for anti-infective molecules remained the highest (92%), and smaller classes were generally predicted less accurately (Figure C). The smallest class, “urological agent”, which contains only 26 molecules, was rarely predicted correctly in the validation set (40%). The difference in class sizes likely contributes to this deficiency, and we did not directly control for this during training. Cardiovascular drugs were consistently confused for CNS drugs across all three subtasks.

Figure 3

Confusion matrices from IMG + CNN classifiers showing the proportion of each predicted class (x axis) for molecules in each true class (y axis). Results are from the (A) 3-, (B) 5-, and (C) 12-class task subgroups. Each matrix shows the predictions from the fifth validation set using IMG + CNN models trained on the large single-class data set. The same confusion matrix analysis of per-class validation set predictions for results from the MFP + RF models showed overall similar percentages of correct predictions within 5% points (Figure S4). An interesting observation is that the MFP + RF made different general errors than the IMG + CNN model, suggesting that these models learn different features. MFP + RF more often overpredicted drugs as anti-infective, which may explain the 97% correct predictions for that class. For example, the RF and CNN predicted 11 and 0%, respectively, of hematological drugs as anti-infective. There were also class-specific performance differences. MFP + RF models were better at predicting CNS agents correctly (MFP + RF: 90% vs IMG + CNN: 85%) and dermatologic agents (MFP + RF: 57% vs IMG + CNN: 52%), but the IMG + CNN models were better at predicting cardiovascular agents (MFP + RF: 68% vs IMG + CNN: 73%). Overall, these class-wise analyses of model predictions with confusion matrixes revealed interesting differences in model performance.

Chemical Insight into Learned Molecule Properties

Druglike properties of a compound are related to its chemical features such as molecular weight, lipophilicity, and the number of hydrogen bond donors and acceptors (e.g., quantitative structure activity relationships). Lipinski’s rule of five famously states that a druglike compound should have no more than five hydrogen bond donors, 10 hydrogen bond acceptors, a log P greater than 5 (related to hydrophobicity), and a molecular weight under 500.[3] These properties are directly or indirectly encoded in the image- and fingerprint-based representations of chemical structure we used to train models. To make inferences about what chemical properties our models may have learned, we computed molecular weight, Xlog P, and hydrogen bond donors and acceptors for our all single-class molecules and compared their distributions with one-way ANOVA and Games-Howell posthoc testing (Figures and S5; Tables S2 and S4). Our IMG + CNN model often confused respiratory drugs with cardiovascular drugs (30%, Figure ), and the chemical property analysis revealed that this drug class was indistinguishable from cardiovascular drugs with regard to the four computed properties (Table S4, row 51, adjusted p-value = 1). The similarity of these properties may explain the confusion. Conversely, respiratory drugs are indistinguishable from gastrointestinal drugs across Lipinski’s properties, but both models can easily distinguish these two classes. This suggests that there are important structural features of drugs learned by the classifiers that fall beyond the conventional framework of how chemists understand druglike chemical properties.

Figure 4

Class-level distribution of Lipinski’s druglike properties: (A) molecular weight, (B) Xlog P, (C) hydrogen bond acceptor count, and (D) hydrogen bond donor count.

Class-level distribution of Lipinski’s druglike properties: (A) molecular weight, (B) Xlog P, (C) hydrogen bond acceptor count, and (D) hydrogen bond donor count. Other connections between statistically significant differences in molecular properties and model confusion are less interpretable. Dermatologic drugs were often confused for anti-infective (17%) and gastrointestinal drugs (9%) despite all properties showing significant differences (adjusted p-values < 3.6 × 10–3). However, the MFP + RF model often confused dermatologic drugs for CNS agents (Figure S4), which is a mistake that the IMG + CNN model never makes (0%). This pair of chemical classes is statistically different in only the number of hydrogen bond donors. Thus, it is possible that the RF is underweighting this difference, whereas the IMG + CNN model has learned to use it.

Drug-Repurposing Opportunities and Mechanism of Action from Misclassification

Misclassification of drugs can be interpreted in at least two ways: (1) the model has not learned enough to accurately predict the true class, or (2) the model has learned something new about the drugs and classes. Although the latter is more interesting, the former is the safer and more likely interpretation. However, cases where the model is wrong might present opportunities for drug repurposing. In addition, we hypothesize that those incorrect predictions might be useful for understanding drug mechanisms. For example, among the six molecules in the urological drug validation set, the IMG + CNN model misclassified trospium as a central nervous system (CNS) agent. This is not surprising, however, because trospium is known mechanistically as a muscarinic antagonist,[60] which is a common function of CNS drugs. In fact, the structure of trospium is similar to another muscarinic antagonist used to treat Parkinson’s disease, benztropine[61] (Figure ).

Figure 5

Example of a misclassified drug that reveals the mechanism and repurposing opportunities. (A) Structure of trospium, a urological drug known to act as a cholinergic muscarinic antagonist, which was classified by the model as a CNS agent. (B) Structure of benztropine, a muscarinic antagonist used to treat Parkinson’s disease.

Multilabel Classification

The same strategy of structure-based drug molecule function prediction was also extended to multilabel classification, where more than one MeSH therapeutic uses class can be predicted for each chemical structure. When not filtered for molecules present only in a single class, there are a total of 8336 molecules assigned to these 12 MeSH therapeutic uses classes; each molecule has an average of 1.2 class labels. We trained a separate IMG + CNN model to learn these drug classes in the multilabel setting and evaluated the accuracy, Fβ score, ROC AUC, and average precision score (Table ). This multilabel prediction model achieved excellent performance in all computed metrics, but the ROC AUC and average precision scores show that the multilabel prediction task is more challenging than the single-label version (Table ). This demonstrates the feasibility of extending our structure-based therapeutic use prediction models for multifunction prediction.

Table 5

Multilabel Classification of the 8336 Molecules with an Average of 1.2 Classes Assigned to Each Moleculea

problem group	metric	IMG + CNN
12-class multilabel	thresh. accuracyb	0.954 ± 0.00133
	F_βb	0.635 ± 0.0168
	ROC AUC	0.938 ± 0.00353
	ave. prec. score	0.837 ± 0.00953

Class score thresholds set to 0.5.

Results are from fivefold cross-validation with folds determined by iterative stratification. Accuracy, ROC AUC, and average precision scores are not directly comparable to the 12-class single-class formulation (Table ) because the number of molecules differs. Class score thresholds set to 0.5. Multilabel predictions were evaluated locally by examining specific examples. Examples of multiclass molecules that were predicted perfectly are shown in Figure . Each of these molecules was correctly classified as anti-infective despite very diverse chemical constituents. Hypocrellin is a round, nearly planar compound with a significant aromatic electron system and several types of oxygen functional groups, including methyl ethers, hydroxyls, and ketones (Figure A). Sulfasalazine is often used as a first-line rheumatoid arthritis treatment and is a long compound with multiple unique functional groups, including azobenzene (Figure B). Acemannan is a mucopolysaccharide from aloe vera leaves (Figure C). Hypocrellin and acemannan were also correctly labeled as antineoplastic drugs, and both sulfasalazine and acemannan were recognized as gastrointestinal drugs. These examples show that the model has learned to associate diverse chemistries with drug functions.

Figure 6

Examples of correctly predicted multiclass molecules. (A) Hypocrellin is classified as anti-infective, antineoplastic, and dermatologic. (B) Sulfasalazine is classified as anti-infective, anti-inflammatory, and gastrointestinal. (C) Acemannan is classified as anti-infective, antineoplastic, and gastrointestinal. Multilabel predictions were also evaluated globally with network relationship analysis. True and predicted pairwise drug class memberships were used to generate class connections for drug class networks (Figure ). Most of the true relationships were recovered in the predicted relationships, but the model underestimated the strength of some relationships in the first validation fold (e.g., hematologic-cardio). The model also predicted that some new connections were missing from the true network, such as between the “respiratory” class and “urological” class. This global network analysis enabled a closer look at new connections between classes in the predicted set that were absent in the true class labels. For example, PubChem CID 121878 is classified as a cardiovascular drug, but the multiclass model predicts that this drug also functions on the respiratory system. PubChem suggests that this molecule is chemically related to cromakalim (Figure S6), which is a potassium channel modulating vasodilator known to act as a brochodilator.[62]

Figure 7

Analysis of multilabel drug classification using the IMG + CNN model. A network of relationships was computed from the (A) true class labels or (B) predicted class labels of the first validation fold. The width of each edge denotes the strength or frequency of co-occurrence. Red nodes indicate class grouping determined by their co-occurrence. Other examples are less obvious based on expert interpretation of the chemical structure. For example, model predictions on validation fold 5 suggest a weak connection between “reproductive control” and “hematologic” classes (Figure S7B) that is absent from the true class labels (Figure S7A). Closer inspection of the class predictions reveals that this was due to prediction of menadione (vitamin K3) as “reproductive control” in addition to its true label of hematologic. Although there is no obvious structural similarity between menadione (Figure S7C) and common reproductive control agents like estrogen (Figure S7D), a literature search reveals several connections that validate this prediction. Menadione is believed to activate extracellular signal-regulated kinases (ERK) and prevent inhibition of EGFR.[63] Estrogen, a reproductive control agent, also activates ERK signaling through transactivation of upstream EGFR.[64] In addition, both excess vitamin K and estrogen alter the risk of venous thrombosis.[65,66] Thus, the model has uncovered a functional relationship that is not apparent from a comparison of these two molecular structures. Altogether, these global analyses of drug function relationships helped reveal opportunities for drug repurposing.

Discussion

Here, we report two drug classification models that greatly exceed a previous benchmark on the same prediction task. Our models use molecular structures directly as inputs, whereas the previous study used alterations of the transcriptome as a proxy for molecules. The results presented here suggest that the experimental measurement of the influence of a molecule in biological systems may not be needed to accurately predict some types of chemical properties, such as annotated drug classes. However, we do not believe this necessarily means that direct empirical measure of the system is useless. Rather, additional research is required to determine which types of chemical prediction tasks require information about the biological state induced by chemicals and what type of biological state information is most useful (e.g., omic data, cellular morphology, etc.). There is likely room to improve effect-based models that would outperform molecular-structure-derived models, especially on more complex prediction tasks. Because the models presented here do not require empirical measurements of effects of chemicals, they can be broadly applied to predict drug classes after training; images and MFPs can be generated directly from the chemical structure for any chemical. A major limitation of the Aliper et al. featurization, or any empirically determined featurization, is that it requires new experiments for each new compound. This restricted the total data set of Aliper et al. to only 676 drugs for which transcriptomic data was available, thereby fundamentally limiting the utility of prediction to fewer compounds. Therefore, there must be a substantial improvement in predictive performance to justify the extra cost of using experimentally derived features in a virtual screening or chemical prediction setting. We propose that future studies on chemical prediction tasks that use empirically determined featurizations also use models that consider only chemical structure features as a baseline. Although there are several chemistry problems where DNNs outperform other shallow machine-learning methods,[49,67,68] here the MFP + RF performed best with the small data set of 676 molecules in the 5- and 12-class predictions. However, in the 3-class task with the small data set and all of the tasks with the large data set, the two models produced accuracies that were nearly indistinguishable. Because the performance of our two models was similar on the larger data set, our results suggest that the CNN has more difficulty learning many classes from a small amount of data. This highlights that, in general, more complex models should be benchmarked against strong standard machine-learning methods, especially when training data is limited. Much can be learned about chemical function from the cases where we find misclassification of chemical structures. We show cases where this can be rationalized by chemical properties of the molecules and cases where these properties that we often use to define the character of a chemical cannot explain the classification performance. In the latter case, this may mean that our models have learned something about chemistry that may not be recognized by chemists. Still, the class-specific differences in molecular properties are interesting to compare. Furthermore, when the models misclassify a structure, we can interpret this both as suggestive of a shared drug mechanism and as an opportunity for drug repurposing. Drug repurposing is an especially important aspect of this work because the application of an already approved drug is much less costly than de novo approval of a new chemical. Finally, we show how these models can be used for multilabel classification of drugs with multiple known functions and explore how incorrect predictions from multilabel classification can be used for drug repurposing. Our initial experiments followed the setting from Aliper et al. that excluded chemicals with multiple therapeutic uses, but we also extend the concept to a multilabel prediction model. The results show that our strategy is effective for the more complex multiclass prediction and that true relationships between drug classes are learned and recovered even with a relatively small data set of less than 10 000 molecules. We give examples of how these models learn to associate diverse chemistries with the same predicted function. In addition, we give examples where the literature supports misclassification, suggesting that these models may be useful in predicting off-target effects and repurposing opportunities. Taken together, our multilabel classification results prove the feasibility of this strategy with relatively simple modern deep-learning packages. Previous studies have demonstrated prediction of other types of drug classifications, for example, anatomical therapeutic chemical (ATC) codes (https://www.whocc.no/atc/structure_and_principles/).[69−72] Several types of models and chemical featurizations have been explored for ATC code prediction. Future work could apply the strategies described here for MeSH terms to predict ATC codes.

49 in total

1. Random forest: a classification and regression tool for compound classification and QSAR modeling.

Authors: Vladimir Svetnik; Andy Liaw; Christopher Tong; J Christopher Culberson; Robert P Sheridan; Bradley P Feuston
Journal: J Chem Inf Comput Sci Date: 2003 Nov-Dec

Review 2. Deep learning.

Authors: Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal: Nature Date: 2015-05-28 Impact factor: 49.962

3. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17.

Authors: Lars Ruddigkeit; Ruud van Deursen; Lorenz C Blum; Jean-Louis Reymond
Journal: J Chem Inf Model Date: 2012-11-01 Impact factor: 4.956

4. Deep Learning Based Regression and Multiclass Models for Acute Oral Toxicity Prediction with Automatic Chemical Feature Extraction.

Authors: Youjun Xu; Jianfeng Pei; Luhua Lai
Journal: J Chem Inf Model Date: 2017-10-27 Impact factor: 4.956

Review 5. Deep learning for computational chemistry.

Authors: Garrett B Goh; Nathan O Hodas; Abhinav Vishnu
Journal: J Comput Chem Date: 2017-03-08 Impact factor: 3.376

6. Estrogen-induced activation of Erk-1 and Erk-2 requires the G protein-coupled receptor homolog, GPR30, and occurs via trans-activation of the epidermal growth factor receptor through release of HB-EGF.

Authors: E J Filardo; J A Quinn; K I Bland; A R Frackelton
Journal: Mol Endocrinol Date: 2000-10

7. The Face of a Molecule.

Authors: William H Gerwick
Journal: J Nat Prod Date: 2017-09-08 Impact factor: 4.050

8. Practical Model Selection for Prospective Virtual Screening.

Authors: Shengchao Liu; Moayad Alnammi; Spencer S Ericksen; Andrew F Voter; Gene E Ananiev; James L Keck; F Michael Hoffmann; Scott A Wildman; Anthony Gitter
Journal: J Chem Inf Model Date: 2018-12-18 Impact factor: 4.956

Review 9. Opportunities and obstacles for deep learning in biology and medicine.

Authors: Travers Ching; Daniel S Himmelstein; Brett K Beaulieu-Jones; Alexandr A Kalinin; Brian T Do; Gregory P Way; Enrico Ferrero; Paul-Michael Agapow; Michael Zietz; Michael M Hoffman; Wei Xie; Gail L Rosen; Benjamin J Lengerich; Johnny Israeli; Jack Lanchantin; Stephen Woloszynek; Anne E Carpenter; Avanti Shrikumar; Jinbo Xu; Evan M Cofer; Christopher A Lavender; Srinivas C Turaga; Amr M Alexandari; Zhiyong Lu; David J Harris; Dave DeCaprio; Yanjun Qi; Anshul Kundaje; Yifan Peng; Laura K Wiley; Marwin H S Segler; Simina M Boca; S Joshua Swamidass; Austin Huang; Anthony Gitter; Casey S Greene
Journal: J R Soc Interface Date: 2018-04 Impact factor: 4.293

10. PubChem Substance and Compound databases.

Authors: Sunghwan Kim; Paul A Thiessen; Evan E Bolton; Jie Chen; Gang Fu; Asta Gindulyte; Lianyi Han; Jane He; Siqian He; Benjamin A Shoemaker; Jiyao Wang; Bo Yu; Jian Zhang; Stephen H Bryant
Journal: Nucleic Acids Res Date: 2015-09-22 Impact factor: 16.971

12 in total

1. Novel Development of Predictive Feature Fingerprints to Identify Chemistry-Based Features for the Effective Drug Design of SARS-CoV-2 Target Antagonists and Inhibitors Using Machine Learning.

Authors: Kelvin Cooper; Christopher Baddeley; Bernie French; Katherine Gibson; James Golden; Thiam Lee; Sadrach Pierre; Brent Weiss; Jason Yang
Journal: ACS Omega Date: 2021-02-05

2. Revealing new therapeutic opportunities through drug target prediction: a class imbalance-tolerant machine learning approach.

Authors: Siqi Liang; Haiyuan Yu
Journal: Bioinformatics Date: 2020-08-15 Impact factor: 6.937

3. A Genetic Neural Net Model for the Relationship between Pre-School and Attention in Early Childhood.

Authors: Liping Wang; Na Yao
Journal: Comput Intell Neurosci Date: 2022-06-06

4. Chemical Class Prediction of Unknown Biomolecules Using Ion Mobility-Mass Spectrometry and Machine Learning: Supervised Inference of Feature Taxonomy from Ensemble Randomization.

Authors: Jaqueline A Picache; Jody C May; John A McLean
Journal: Anal Chem Date: 2020-07-23 Impact factor: 8.008

5. Topological Distance-Based Electron Interaction Tensor to Apply a Convolutional Neural Network on Drug-like Compounds.

Authors: Hyun Kil Shin
Journal: ACS Omega Date: 2021-12-15