Literature DB >> 24010729

Using molecular features of xenobiotics to predict hepatic gene expression response.

Abstract

Despite recent advances in molecular medicine and rational drug design, many drugs still fail because toxic effects arise at the cellular and tissue level. In order to better understand these effects, cellular assays can generate high-throughput measurements of gene expression changes induced by small molecules. However, our understanding of how the chemical features of small molecules influence gene expression is very limited. Therefore, we investigated the extent to which chemical features of small molecules can reliably be associated with significant changes in gene expression. Specifically, we analyzed the gene expression response of rat liver cells to 170 different drugs and searched for genes whose expression could be related to chemical features alone. Surprisingly, we can predict the up-regulation of 87 genes (increased expression of at least 1.5 times compared to controls). We show an average cross-validation predictive area under the receiver operating characteristic curve (AUROC) of 0.7 or greater for each of these 87 genes. We applied our method to an external data set of rat liver gene expression response to a novel drug and achieved an AUROC of 0.7. We also validated our approach by predicting up-regulation of Cytochrome P450 1A2 (CYP1A2) in three drugs known to induce CYP1A2 that were not in our data set. Finally, a detailed analysis of the CYP1A2 predictor allowed us to identify which fragments made significant contributions to the predictive scores.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：

Year: 2013 PMID： 24010729 PMCID： PMC3810861 DOI： 10.1021/ci3005868

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

The liver response to a drug is critical in determining the ultimate effect the drug will have on the body. It is well-known that the first-pass effect of the cytochrome P450s, metabolizing enzymes, and transporters can greatly reduce the bioavailability of a drug or transform a prodrug into its active form.[1] Subsequent metabolic processes often eliminate drugs from the body either exclusively by the liver or in conjunction with the kidney. Because the liver performs these critical roles in processing xenobiotics, the liver response must be considered when determining drug doses or drug combinations to ensure that toxic levels of chemical species do not accumulate in the body and lead to adverse drug reactions.[2] Gene expression response is a well-known way to measure and quantify the liver’s response to xenobiotic stimulus. However, the mechanism by which a drug will lead to a change in gene expression is not fully understood. In this work we were interested in determining which genes have their expression predictably changed in the liver directly in response to small molecules. In particular, we used publicly available data sets of drug and liver response data to seek genes whose expression was greatly affected by specific chemical features of small molecules. Molecular fingerprints provide an efficient method to characterize a chemical as a set of molecular features represented by unique identifiers.[3] There are many varieties of fingerprinting methods used for similarity searching or virtual screening of large chemical libraries.[4] Extended connectivity fingerprints (ECFP) are based on chemical bond topology and capture features relevant to molecular activity.[5] They have been successfully applied for predicting chemical activities, even among structurally diverse compounds.[6,7] ECFP4 fingerprints generate unique identifiers for topological fragments that contain up to four bonds and are among the highest performing fingerprints for identifying similar molecules with known activities.[8] Gene expression microarrays enable the simultaneous measurement of tens of thousands RNA expression probes in a tissue and have been used to detect significant differences between healthy and diseased tissues.[9] Gene expression experiments have also been used to detect significant RNA expression responses in tissues that have been treated with drugs.[10] For example, the Connectivity Map data set contains gene expression measurements on 1309 compounds and has been used in drug repositioning[11,12] and for elucidating the mechanism of action of drugs.[13] The DrugMatrix database contains RNA expression data from approximately 600 different compounds given in vivo to rats at different doses and time points and then measured on seven different tissues. These DrugMatrix data have been used to study drug toxicology and liver response profiles.[14−17] Others have connected drug structure to gene expression but focused on a large database of predefined chemical structures and gene expression in cancer cell lines.[18] In this study, we sought to connect the molecular level information contained in chemical fingerprints to the cellular level measures in the drug-induced liver gene expression data from DrugMatrix. In particular, we were interested in finding genes in the liver that are predictably up-regulated in response to small molecules, described using chemical fingerprints. A model by which a drug elicits changes in gene expression is illustrated in Figure 1. In this model a drug binds to a receptor, such as an allosteric site on a cell surface or a cytosolic protein, which triggers a series of signaling events. The signals reach the nucleus and trigger nuclear responses, which lead to transcriptional changes. The specific interactions and pathways that lead to these changes in transcription are largely unknown, and the elucidation of these pathways is an active area of research.[19−21] Therefore, any attempt to predict expression changes using existing pathways is necessarily based on incomplete information. Our approach circumvents this problem by finding direct relationships between chemical fingerprint features and gene expression changes, as also illustrated in Figure 1.

Figure 1

Drugs are shown schematically on the outside of the cell with different chemical features (green square, red triangle, blue half-circle). These features may lead to binding at one or more receptors that begin a signaling cascade ultimately leading to gene expression changes. Gene transcripts as shown as circles, where each circle is drawn in the color (or colors) associated with the chemical features associated with its expression. The work presented here focuses on finding the association of chemical features to the genes whose expression is directly predictable from a subset of these chemical features alone. The method presented here has three parts: (1) we generate a reduced feature space to describe the drugs; (2) we use machine learning techniques to create and validate classifiers which predict if a gene will be up-regulated by a drug; and (3) we analyze the features of the classifiers to describe the drugs which are predicted to up-regulate the genes. To make our reduced feature space, we generated an matrix of fingerprints from a compendium of drugs in DrugBank[22] and then transformed the feature vectors into a reduced space using the twenty largest principal components of the matrix. We then applied machine learning algorithms to determine if any of the genes which were up-regulated at least 1.5 fold times in the liver could be predicted based on these features. Surprisingly, we found 87 genes that could be reliably predicted using only fingerprint information. We validated these predictions internally by cross validation and in an independent data set measuring the response to pregnenolone 16alpha-carbonitrile, a steroid hormone that was not included in any of our data sets. We also validated our predictive model for cytochrome P450 1A2 (CYP1A2) in three known CYP1A2 inducers that were not present in our data set. Finally, we analyzed the CYP1A2 model to highlight those fragments that are most informative for determining the predictive scores.

Methods

Gene Expression Data Source

We downloaded the normalized Affymetrix whole genome 230 2.0 rat GeneChip array data liver gene expression from the DrugMatrix database[14] and filtered the experiments for probes which had a significant change in expression (p < 0.05) and at least a 1.5 fold increase in expression for at least 20 different drugs at any dose or time period and a 1.1 fold increase or lower for at least 20 different drugs. We empirically chose a cutoff of 1.5 for up-regulated genes because we wanted to make predictions on genes with strong signals, but we also needed to keep a sufficient number of genes in the data set. We chose a cutoff of 1.1 for down regulated genes to reduce the possibility of using false negatives in our training data. The resulting data set contains 170 distinct drugs and 3830 distinct probes. We converted this data set into a 170 × 3830 matrix where each entry has a value of 1 for probes having greater than 1.5 fold expression, a value of −1 for probes with less than 1.1 fold expression, and a value of 0 otherwise.

Molecular Feature Space of Chemicals

We matched 4069 drugs from DrugBank to molecules in ChEMBL[23] using drug names and synonyms and downloaded the canonical SMILES strings from ChEMBL. We then used these SMILES strings to generate extended connectivity fingerprint identifiers with a diameter of four atoms (ECFP4) using JChem Base 5.11.5, 2013, ChemAxon (www.chemaxon.com). Two-dimensional ECFP4 fingerprints are effective in recalling compound activities and are frequently used in QSAR and QSPR models.[24] Each identifier represents a topological substructure feature of a molecule. This resulted in a total of 19 810 distinct structure identifiers used to describe 4069 DrugBank drugs. Supporting Information Figure 3 shows a histogram of the drug counts per ECFP4. The histogram shows that there is an exponential drop off in the number of drugs per ECFP4 feature, with a few features that are nearly ubiquitous and many more that occur in only a handful of drugs. We created a binary feature matrix with 4069 rows (one for each drug) and 19 810 columns (one for each ECFP4 identifier). An entry i, j in the binary feature matrix was set to 1 if and only if drug i contained feature j; otherwise, the entry was 0. We then performed principal components analysis of this 4069 × 19 810 matrix using the prcomp function in the stats package in R.[25] We used the first 20 principal component loadings as the feature representation for chemicals.

Feature Representation of DrugMatrix Chemicals

We matched 170 drug names from the DrugMatrix data set to molecules in ChEMBL by matching drug names with molecule names and synonyms, downloaded their canonical SMILES strings, and generated ECFP4 fingerprints for each of these chemicals, resulting in 2372 distinct ECFP4 identifiers. We then used these 2372 identifiers to generate a 170 × 19 810 binary feature matrix where an entry i, j is was set to 1 if and only if the DrugMatrix drug i contained the ECFP4 identifier j, using only the DrugBank identifiers. If an ECFP4 identifier was present in the set of identifiers from DrugMatrix but not in the set from DrugBank, it was not used. We then projected this 170 × 19 810 matrix onto the first 20 components of the DrugBank PCA, resulting in a 170 × 20 feature matrix.

Correlation of Expression Similarity and Chemical Similarity

We computed the pairwise Tanimoto similarity of all 170 drugs using ECFP4 fingerprints and the pairwise correlation of the expression values of the 3830 probes used for training. We fit a linear model using the lm function in the R base package with the similarity in gene expression as the outcome variable and the similarity of the drugs as the predictor variable.[25]

Machine Learning Approach

We generated 10 different training and evaluation data sets by randomly choosing 80% of the 170 drugs for training (n = 136) and 20% of the drugs for evaluation (n = 34) for each of the 10 iterations. We then performed machine learning on each of the 10 training data sets with the following process: Generate a drug × PCA loading feature matrix. We selected drugs for which there were at least 20 positives (expression > 1.5) and 20 negatives (expression < 1.1), in our expression data set. For each of the 3830 probes, generate 25 bootstrap samples and use L1 constrained logistic regression models by training on 80% of the drugs (a 109 × 20 matrix), chosen randomly for each iteration, and measuring the area under the receiver operating characteristic curve (AUROC) using the remaining 20% of the drugs (n = 27). These AUROC values serve as a performance metric for how well the models classify each of the probes as up 1.5 fold. The L1 logistic regression models were generated using the cv.glmnet function from the glmnet package in R.[26,27] We set the parameters of the cv.glmnet function to perform cross-validation using 10 folds and choose a value of that minimized the error within one standard error of the minimum, as recommended to avoid overfitting.[26] We calculated ROC curves and AUROC values using the ROCR package in R.[27] We performed an additional 10 iterations of this machine learning approach with the positive and negative labels shuffled on the training data, resulting in baseline performance metrics for 100 randomly labeled training sets. The metrics from these values indicate what performance we would expect to see at random.

Machine Learning Evaluation

We calculated and plotted the mean and standard deviation of the AUROC values across the 25 iterations for all of the probe classifiers within each of the 10 training sets. A comparison with the randomly shuffled results indicated that any classifier with a mean AUROC ≥ 0.534 had significant performance, and classifiers with a mean AUROC ≥ 0.7 were 1 standard error above the 0.534 cutoff. We therefore selected all classifiers which had a mean AUROC ≥ 0.7 in any of the training sets and evaluated the performance of the models by generating an AUROC using the evaluation drugs (n = 34) for each iteration. If a classifier had a mean AUROC ≥ 0.7 in more than one training set, we tested it in each of the models.

External Validation

We downloaded the GSE 4959 data set[28] from the Gene Expression Omnibus and analyzed the pregnenolone 16alpha-carbonitrile treated versus control samples using the GEO2R tool.[29] This data set measures the rat liver response to pregnenolone 16alpha-carbonitrile, a drug that is not present in the original data set. We chose all gene expression classifiers that had a mean AUROC ≥ 0.7 in four or more of the iterations (n = 87) for external validation and generated complete models using all 170 drugs to train the L1 constrained logistic regression. We then generated the ECFP4 features for pregnenolone 16alpha-carbonitrile and then projected those features into the top 20 PCA components using the PCA rotation matrix from the DrugBank data set. We matched 62/87 of the probes from GSE4959 to probes in our data set using Affymetrix identifiers. Finally, we scored each of these 62 probes and normalized their scores between 0 and 1 to generate a single ROC curve. We computed the Tanimoto similarity of pregnenolone 16alpha-carbonitrile to all 170 molecules in the data set to determine its maximum similarity. We also computed the Tanimoto similarity of the gene expression signature sets consisting of all genes that were up-regulated at least 1.5 fold in the presence of pregnenolone 16alpha-carbonitrile versus the expression signatures for all 170 molecules from the DrugMatrix data set.

Validation and Analysis of CYP1A2 Classifier

The best performing classifier in our data set was for a transcript encoding Cytochrome P450 1A2 (CYP1A2), an important enzyme involved in the metabolism of xenobiotics. We searched the PubMed database with the terms “CYP1A2” and “induction” and identified three small molecules that have been shown to induce expression of CYP1A2, listed in Table 1, which were not in our data set. We generated the ECFP4 fingerprints for these drugs and projected the resulting feature vectors into our feature space using the top 20 principal components of the PCA rotation matrix from the DrugBank data set. We scored these molecules using the classifier to determine if they were correctly predicted to induce expression of CYP1A2.

Table 1

Three Drugs Known to Induce Expression of CYP1A2 and Their Predicted Score Using the CYP1A2 Classifiera

drug name	reference	max Tanimoto similarity	score
omeprazole	Rost et al. (1992)[33]	0.46	1.59
3,3-diindolylmethane	Lake et al. (1998)[34]	0.29	1.28
RO4938581	Bundgaard et al. (2013)[35]	0.14	0.31

Also shown is the maximum Tanimoto similarity to any of the 170 drugs in the data set. The similarity score was calculated using ECFP4 fingerprints.

Also shown is the maximum Tanimoto similarity to any of the 170 drugs in the data set. The similarity score was calculated using ECFP4 fingerprints. To analyze the features used to predict up-regulation of CYP1A2, we took the complete model for CYP1A2 and computed a weight vector , using the following equation:where {CYP1A2 PCs} is the set of principal components chosen by the L1 constrained logistic regression to have nonzero beta coefficients and each β is the beta coefficient for the corresponding principal component. The resulting vector is a vector of 19 810 real values, {w1, ..., w19 810} where each value w represents the contribution that the presence of ECFP4 makes to the score of any given molecule. We tested for enrichment of each of the ECFP4s in the true positives or true negatives using Fisher’s Exact test with multiple hypothesis testing correction using the false discovery rate.[30] We checked for the presence of the ECFP4 fragments with the largest positive and negative weights using ChemAxon’s “jcsearch” program and generated visualizations of the fragments with ChemAxon’s “mview” program.

Results

Gene Expression

The Affymetrix Rat 230 2.0 GeneChip data set we used contains multiple time points and multiple doses for some drugs (657 distinct drug-dose combinations in rat liver tissue with measurements on 31 042 probes). For the purpose of this work we were interested in building classifiers, which requires choosing cutoffs to determine which genes were true positives (up-regulated) and which were true negatives (not up-regulated). In order to avoid potential mislabeling of probes we focused on the more extreme signals by labeling probes with fold changes of 1.5 or greater as true positives (up-regulated) and probes with fold changes of 1.1 or less as true negatives (not up-regulated). Any gene measurements in between these two values were not considered in this analysis. In order to ensure the data set had a sufficient number of positive and negative examples we filtered for probes that were up-regulated at least 1.5 fold in the presence of at least 20 distinct drugs and 1.1 fold or less for at least 20 distinct drugs. After this filtering we were left with 3830 probes. The distribution of fold change values is shown in Figure 2.

Figure 2

Distribution of fold change values for transcripts up-regulated by drugs. The area highlighted in cyan shows the expression values which are at least 1.5 fold and were used as positive examples (n = 126 040); the area highlighted in orange shows expression values less than or equal to 1.1 times and were used as negative examples (n = 41 185). The gray area shows probes with expression between 1.1 and 1.5 (n = 198 266), which were not used.

Molecular Descriptors for Drugs

The principal components analysis of the DrugBank ECFP4 matrix (4069 drugs × 19 810 ECFP4 identifiers) resulted in a projection of the drugs into a reduced feature space that describes most of medically relevant chemical space. A scree plot of the first 100 components for this projection (Figure 3) shows that there is a considerable drop in variance explained after the first few components, followed by a gradual drop off. We projected the ECPF4 features into the top 20 principle components and used those 20 principal component loadings as the features to describe the drugs. Taken together these 20 components explain 30% of the variance in the extended connectivity fingerprints for 4069 drugs taken from DrugBank. This is a considerable reduction in feature space and leaves a significant proportion of the variance unexplained. However, it focuses our search on molecular fragments that occur in multiple drugs and greatly reduces the chance of over fitting our classifiers. Using this projection we were able to reduce our feature set from 2372 distinct ECFP4 identifiers to 20 loadings on the top 20 principal components, resulting in a 170 drug by 20 principal component loading matrix.

Figure 3

Scree plot of first 100 principal components of extended connectivity fingerprint identifiers for 4069 drugs from DrugBank. Taken together these components essentially describe all current pharmacologically relevant chemical fingerprint space. We chose the first 20 components, delimited by the vertical line, to represent the features of drugs, which accounts for 30% of the variance in the DrugBank ECFP4 (4069 drug × 19 810 ECFP4) matrix.

Correlation of Expression Similarity and Chemical Similarity

Interestingly, the Tanimoto similarity of the drugs was not correlated with similarity in gene expression. The correlation of between similarity in structure and similarity in expression was 0.10.

Performance of Classifiers

Figure 4A plots the mean AUROC of each gene in one training set and shows that most genes did not perform well and have a mean AUROC at or near 0.5, which indicates that their performance is no better than random. However, some classifiers performed well in many iterations and have performance considerably greater than 0.5.

Figure 4

Plot of the mean area under receiver operating characteristic curve (AUROC) for classifiers for each of 3830 genes sorted in increasing order for real data (A) and randomized labels (B). The dotted red line indicates 0.5, where performance is equivalent to random chance. The solid red line indicates a mean AUROC of 0.7. Classifiers above the red line will rank a randomly chosen true positive greater than a randomly chosen true negative 70% of the time. (A) Larger number of high scoring genes. (B) Only 5% of genes with randomly permuted labels have a mean AUROC > 0.534, indicated by the purple line. The results of our permutation analysis are shown in Figure 4B. A mean AUROC > 0.534 would occur by chance only 5% of the time, giving an estimated p-value, p̂ ≤ 0.05. These genes are highlighted in purple in Figure 4A and B. With this estimated p-value there are 1198 probe classifiers in Figure 4A with significant performance. To identify the strongest signals, we focused on the probes with an even greater mean AUROC ≥ 0.7, and in the case of the data shown in Figure 4A, there 171 of these genes. All 10 of the summary plots for the data sets are provided in Supporting Information Figure 1. The performance of the probe classifiers also varied depending upon which of the 10 training and evaluation sets we used. Some probe classifiers had performance ≥0.7 in just one of the 10 iterations and some performed well in all 10 iterations. Figure 5 shows box plots summarizing the mean AUROCs for probe classifiers that have mean AUROC ≥ 0.7 in one or more of the 10 training data sets. As expected, the mean performance of the probe classifiers increases as they occur in the top results of more data sets. The mean performance of probe classifiers which performed well in only one or two data sets is near 0.5, but if a probe classifier performed well in four or more data sets then the mean AUROC > 0.6 across all 10 data sets. To account for this variation we chose to use the 87 probe classifiers that were consistently up in four or more data sets for external validation.

Figure 5

Box plots summarizing results from 10 randomly sampled training (n = 136) and test (n = 34) data sets from DrugMatrix. Scores of all probe classifiers with an AUROC ≥ 0.7 in at least one of the 10 data sets is included. As the number of data sets in which probe performs well increases the mean AUROC increases. We chose the probes that had a mean AUROC ≥ 0.7 in four or more data sets for external validation. The maximum Tanimoto similarity of pregnenolone 16alpha-carbonitrile to any of the 170 drugs from DrugMatrix was 0.3286. The maximum Tanimoto similarity of the expression profile of genes up-regulated at least 1.5 fold was 0.389. The ROC curve for predicting the rat liver gene expression response to pregnenolone 16alpha-carbonitrile in the 62 chosen classifiers is shown in Figure 6 and shows an AUROC of 0.7. The list of 62 probes used to generate this ROC curve is shown in Supporting Information Table 1.

Figure 6

Receiver operating characteristic (ROC) curve for 62 probes from GSE4959, measuring the rat liver response to pregnenolone 16alpha-carbonitrile. We matched 62 of the 87 probe classifiers to probes in GSE4959. We normalized the scores generated by the classifiers between 0 and 1 and then used the true positive labels to generate the ROC curve. With a false positive rate of 0.2 the true positive rate is 0.67, indicated by the red circle. The area under this ROC curve is 0.7, which is in alignment with the results from our cross-validation.

Validation and Analysis of CYP1A2 Classifier

The probe with the best predictive classifier performance was 1387243_at, which is a transcript encoding Cytochrome P450 1A2 (CYP1A2). The classifier for this probe had a mean AUROC of 0.86 with a standard error of 0.09 across all 10 of the training and evaluation sets. We scored three molecules, shown in Table 1, which are known to induce the expression on CYP1A2 using this classifier and were not present in our data set. None of molecules were significantly similar to any other molecules in our data set, the maximum Tanimoto similarity of the molecules to any of the 170 drugs ranged from 0.14 to 0.46. The three known CYP1A2 inducing molecules that we identified, omeprazole, 3,3-diindolylmethane, and RO4938581, scored within the range of scores for the true positive drugs, as shown in Figure 8. Using the weight vector , as described in the methods, we identified the ECFP4 identifiers that contributed the five largest positive and five largest negative weights to each of the CYP1A2 validation drugs. Supporting Information Figure 2 shows each of the drugs and highlights the positive and negative features associated with the ECFP4 identifiers.

Figure 8

Histogram showing scores of true negative drugs that do not up-regulate CYP1A2 at least 1.5 fold (in orange) and scores for true positive drugs (in cyan) that do up-regulate CYP1A2. The scores of three known CYP1A2 inducers (from Table 1) are annotated on the histogram. The three drugs score well within the range of the true positives.

Histogram showing the frequency of ECFP4 identifiers that are enriched in either the true positive drugs, which up-regulate CYP1A2 (in cyan) and true negative drugs, which do not up-regulate CYP1A2 (in orange). A fragment is considered significantly enriched if the adjusted p-value for enrichment in the set using Fisher’s exact test was less than 0.1. The weight for each ECFP4 is shown above each fragment, indicating how much the presence of each identifier contributes to the score for a given drug. Histogram showing scores of true negative drugs that do not up-regulate CYP1A2 at least 1.5 fold (in orange) and scores for true positive drugs (in cyan) that do up-regulate CYP1A2. The scores of three known CYP1A2 inducers (from Table 1) are annotated on the histogram. The three drugs score well within the range of the true positives.

Discussion

We have shown that there is a subset of genes in the liver that can reliably be predicted to be up-regulated based on the chemical information found in ECFP4s. Because there is minimal correlation between the ECFP4 fingerprint similarities of the drugs and the gene expression that the drugs induce (ρ = 0.10) the connections between ECFP4 and gene expression must be made through the presence of coordinated subsets of ECFP4 identifiers rather than overall chemical similarity. We created our classifiers using a drastically reduced set of features generated by projecting our data into the first 20 principal components from an ECFP4 matrix consisting of all the ECFP4 identifiers generated from the drugs in DrugBank. These 20 principal components only capture 30% of the variance in the DrugBank ECFP4 matrix and many ECFP4 fragments in the DrugMatrix data set are not represented in the feature vectors. By using this reduced feature space the classifiers were forced to find combinations of principal components that best predict up-regulation of the genes. For the majority of the genes the classifiers were unable to make accurate predictions. However, 87 genes were reliably predictable across multiple bootstrap samples including CYP1A2, which we were able to validate in three drugs known to induce CYP1A2. Using a deeper analysis of the CYP1A2 classifier we identified which ECFP4 features of these drugs made the largest contributions to their positive and negative scores.

Strengths and Weaknesses

There are many concerns when evaluating a machine learning experiment with large feature sets, such as overfitting, validation, and interpretation. In this study we addressed these concerns at each stage of our experiment: (1) generation of our drug feature space; (2) machine learning and validation; and (3) analysis of features.

Generation of Feature Space

Rather than generate a feature space based on the only drugs in the training data set we used 4069 drugs from DrugBank, which resulted in 19 810 ECFP4s. Given that DrugBank contains FDA approved drugs, experimental drugs, and nutraceuticals, this set of ECFP4s describes nearly all known features relevant to known drug based medical therapy. Even so, the ECFP4s in the training data included 501 ECFP4s that were not in the DrugBank set because not all drugs and drug variants in the training data are present in DrugBank. We then transformed the 170 feature vectors for the training drugs into the top 20 PCA components of the DrugBank ECFP4 feature matrix. These top 20 features describe 30% of the variance in the DrugBank features. Given this vast reduction in feature space and by using the additional feature selection embedded in L1-regularized regression, we eliminated any concerns of over fitting. With larger high-quality expression data sets for more drugs it would be possible to use more PCA components that would explain more of the variance.

Machine Learning

In order to address statistical significance in our machine learning approach we used repeated bootstrap sampling in order to find predictive models that were useful in repeated data sets. Randomized labels resulted in a mean AUROC greater than 0.534 in only the top 5% of cases (Figure 4B), indicating that a mean AUROC > 0.534 will occur by chance only 5% of the time. By only considering classifiers with a mean AUROC > 0.7 in several bootstrap samples, we ensure that we are not selecting classifiers which are effective only due to chance or which are only effective on the training data. The validation of these classifiers in the external pregnenolone 16alpha-carbonitrile data set (Figure 6) further affirms the reliability of the models.

Analysis of Features

A benefit of this approach is that it allows for interpretation of the significant features used for prediction. Figure 7 shows the most significant ECFP4s for predicting if CYP1A2 is up-regulated, the frequency in which they appear in the true positives and true negatives, and the weight that each ECFP4 identifier contributes to the score. Five ECFP4 identifiers were enriched in the true positive drugs and four ECFP4 identifiers were enriched in the true negative drugs, using Fisher’s exact test (p < 0.1). It is important to recognize that even if an individual feature is enriched in the true positives or true negatives, that feature alone does not determine the score. As seen in Figure 7, there may be similarities and even overlap between the features as well as correlations between the ECFP4s that tend to appear together in drugs. The total score for a given drug comes from the sum of all the weights for the features present in the drug, including the overlapping and correlated features. Despite the limitations in looking at just a few chemical features, it is informative to look at the features that make the largest contributions to the score for a drug. The three validation drugs, known to induce expression of CYP1A2, are shown in Supporting Information Figure 2 in three panels that highlight the features that have the largest positive and largest negative weights. It is interesting to note how the features tend to overlap over common regions of the molecular graphs so that a given portion of a molecule may contribute both positive and negative weights. However, the net contribution of those overlapping features results in either a positive or negative change to the score due to the differences in the size of the values.

Figure 7

Relationship to Previous Work

Others have shown that chemical structure can be related to gene expression. For example, Blower et al. calculated correlations between the structure of cancer drugs and gene expression in NCI-60 cell lines.[18] In that work the authors were able to successfully construct substructure queries for chemical classes that induced similar gene expression sets. However, that work was focused on cancer cell lines and entire expression signatures, instead of individual gene classifiers based on structural features. ECFPs have been used to predict activity classes; Heikamp et al. showed that feature selection on ECFPs with gain-ratio analysis could be used to create reduced fingerprint feature sets for searching activity classes, suggesting that such feature sets could be used for scaffold hopping.[6,31] Similarly, the work presented here uses machine learning and feature selection in order to find sets of features that are predictive of an increase in gene expression. Bender et al. used chemical structure to predict ligand binding in a selection of targets associated with adverse events to successfully develop predictive models for adverse events.[32] We anticipate that additional training data would further increase the number of genes that could be reliably predicted to have increased expression. Given a large enough set of genes, there would be intriguing drug development applications such as identifying drugs or combinations of drugs with increased likelihood of adverse or toxic effects.

31 in total

1. Functional discovery via a compendium of expression profiles.

Authors: T R Hughes; M J Marton; A R Jones; C J Roberts; R Stoughton; C D Armour; H A Bennett; E Coffey; H Dai; Y D He; M J Kidd; A M King; M R Meyer; D Slade; P Y Lum; S B Stepaniants; D D Shoemaker; D Gachotte; K Chakraburtty; J Simon; M Bard; S H Friend
Journal: Cell Date: 2000-07-07 Impact factor: 41.582

2. Enzyme-catalyzed processes of first-pass hepatic and intestinal drug extraction.

Authors:
Journal: Adv Drug Deliv Rev Date: 1997-09-15 Impact factor: 15.470

3. Development of a large-scale chemogenomics database to improve drug candidate selection and to understand mechanisms of chemical toxicity and action.

Authors: Brigitte Ganter; Stuart Tugendreich; Cecelia I Pearson; Eser Ayanoglu; Susanne Baumhueter; Keith A Bostian; Lindsay Brady; Leslie J Browne; John T Calvin; Gwo-Jen Day; Naiomi Breckenridge; Shane Dunlea; Barrett P Eynon; L Mike Furness; Joe Ferng; Mark R Fielden; Susan Y Fujimoto; Li Gong; Christopher Hu; Radha Idury; Michael S B Judo; Kyle L Kolaja; May D Lee; Christopher McSorley; James M Minor; Ramesh V Nair; Georges Natsoulis; Peter Nguyen; Simone M Nicholson; Hang Pham; Alan H Roter; Dongxu Sun; Siqi Tan; Silke Thode; Alexander M Tolley; Antoaneta Vladimirova; Jian Yang; Zhiming Zhou; Kurt Jarnagin
Journal: J Biotechnol Date: 2005-09-29 Impact factor: 3.307

4. Analysis of pharmacology data and the prediction of adverse drug reactions and off-target effects from chemical structure.

Authors: Andreas Bender; Josef Scheiber; Meir Glick; John W Davies; Kamal Azzaoui; Jacques Hamon; Laszlo Urban; Steven Whitebread; Jeremy L Jenkins
Journal: ChemMedChem Date: 2007-06 Impact factor: 3.466

5. How do 2D fingerprints detect structurally diverse active compounds? Revealing compound subset-specific fingerprint features through systematic selection.

Authors: Kathrin Heikamp; Jürgen Bajorath
Journal: J Chem Inf Model Date: 2011-08-08 Impact factor: 4.956

6. Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors: Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal: J Stat Softw Date: 2010 Impact factor: 6.440

Review 7. Investigations toward enhanced understanding of hepatic idiosyncratic drug reactions.

Authors: Michael J Liguori; Jeffrey F Waring
Journal: Expert Opin Drug Metab Toxicol Date: 2006-12 Impact factor: 4.481

8. 3,3'-Diindolylmethane induces CYP1A2 in cultured precision-cut human liver slices.

Authors: B G Lake; J M Tredger; A B Renwick; P T Barton; R J Price
Journal: Xenobiotica Date: 1998-08 Impact factor: 1.908

9. Increase of cytochrome P450IA2 activity by omeprazole: evidence by the 13C-[N-3-methyl]-caffeine breath test in poor and extensive metabolizers of S-mephenytoin.

Authors: K L Rost; H Brösicke; J Brockmöller; M Scheffler; H Helge; I Roots
Journal: Clin Pharmacol Ther Date: 1992-08 Impact factor: 6.875

10. NCBI GEO: archive for functional genomics data sets--10 years on.

Authors: Tanya Barrett; Dennis B Troup; Stephen E Wilhite; Pierre Ledoux; Carlos Evangelista; Irene F Kim; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Rolf N Muertter; Michelle Holko; Oluwabukunmi Ayanbule; Andrey Yefanov; Alexandra Soboleva
Journal: Nucleic Acids Res Date: 2010-11-21 Impact factor: 16.971

2 in total

1. Drugs that reverse disease transcriptomic signatures are more effective in a mouse model of dyslipidemia.

Authors: Allon Wagner; Noa Cohen; Thomas Kelder; Uri Amit; Elad Liebman; David M Steinberg; Marijana Radonjic; Eytan Ruppin
Journal: Mol Syst Biol Date: 2015-03 Impact factor: 11.429

2. Relating Chemical Structure to Cellular Response: An Integrative Analysis of Gene Expression, Bioactivity, and Structural Data Across 11,000 Compounds.

Authors: B Chen; P Greenside; H Paik; M Sirota; D Hadley; A J Butte
Journal: CPT Pharmacometrics Syst Pharmacol Date: 2015-09-29

2 in total