| Literature DB >> 29782487 |
Andrew D Rouillard1, Mark R Hurle1, Pankaj Agarwal1.
Abstract
Target selection is the first and pivotal step in drug discovery. An incorrect choice may not manifest itself for many years after hundreds of millions of research dollars have been spent. We collected a set of 332 targets that succeeded or failed in phase III clinical trials, and explored whether Omic features describing the target genes could predict clinical success. We obtained features from the recently published comprehensive resource: Harmonizome. Nineteen features appeared to be significantly correlated with phase III clinical trial outcomes, but only 4 passed validation schemes that used bootstrapping or modified permutation tests to assess feature robustness and generalizability while accounting for target class selection bias. We also used classifiers to perform multivariate feature selection and found that classifiers with a single feature performed as well in cross-validation as classifiers with more features (AUROC = 0.57 and AUPR = 0.81). The two predominantly selected features were mean mRNA expression across tissues and standard deviation of expression across tissues, where successful targets tended to have lower mean expression and higher expression variance than failed targets. This finding supports the conventional wisdom that it is favorable for a target to be present in the tissue(s) affected by a disease and absent from other tissues. Overall, our results suggest that it is feasible to construct a model integrating interpretable target features to inform target selection. We anticipate deeper insights and better models in the future, as researchers can reuse the data we have provided to improve methods for handling sample biases and learn more informative features. Code, documentation, and data for this study have been deposited on GitHub at https://github.com/arouillard/omic-features-successful-targets.Entities:
Mesh:
Year: 2018 PMID: 29782487 PMCID: PMC5983857 DOI: 10.1371/journal.pcbi.1006142
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Datasets tested for features significantly separating successful targets from failed targets.
| Dataset | Feature Type | Total Genes | Covered Samples | Total Features | Covered Features | Reduced Features |
|---|---|---|---|---|---|---|
| Roadmap Epigenomics Cell and Tissue DNA Methylation Profiles | cell or tissue DNA methylation | 13835 | 227 | 26 | 26 | 4 |
| Allen Brain Atlas Adult Human Brain Tissue Gene Expression Profiles | cell or tissue expression | 17979 | 287 | 416 | 416 | 2 |
| Allen Brain Atlas Adult Mouse Brain Tissue Gene Expression Profiles | cell or tissue expression | 14248 | 287 | 2234 | 2234 | 2 |
| BioGPS Human Cell Type and Tissue Gene Expression Profiles | cell or tissue expression | 16383 | 320 | 86 | 86 | 2 |
| BioGPS Mouse Cell Type and Tissue Gene Expression Profiles | cell or tissue expression | 15443 | 313 | 76 | 76 | 2 |
| GTEx Tissue Gene Expression Profiles | cell or tissue expression | 26005 | 328 | 31 | 31 | 2 |
| GTEx Tissue Sample Gene Expression Profiles | cell or tissue expression | 19250 | 301 | 2920 | 2920 | 2 |
| HPA Cell Line Gene Expression Profiles | cell or tissue expression | 15868 | 259 | 45 | 45 | 1 |
| HPA Tissue Gene Expression Profiles | cell or tissue expression | 17496 | 314 | 33 | 33 | 2 |
| HPA Tissue Protein Expression Profiles | cell or tissue expression | 15788 | 266 | 46 | 46 | 11 |
| HPA Tissue Sample Gene Expression Profiles | cell or tissue expression | 16742 | 300 | 123 | 123 | 2 |
| HPM Cell Type and Tissue Protein Expression Profiles | cell or tissue expression | 7274 | 94 | 6 | 6 | 2 |
| ProteomicsDB Cell Type and Tissue Protein Expression Profiles | cell or tissue expression | 2776 | 28 | 55 | 55 | 5 |
| Roadmap Epigenomics Cell and Tissue Gene Expression Profiles | cell or tissue expression | 12824 | 164 | 59 | 59 | 6 |
| TISSUES Curated Tissue Protein Expression Evidence Scores | cell or tissue expression | 16216 | 317 | 645 | 245 | 106 |
| TISSUES Experimental Tissue Protein Expression Evidence Scores | cell or tissue expression | 17922 | 316 | 245 | 244 | 44 |
| TISSUES Text-mining Tissue Protein Expression Evidence Scores | cell or tissue expression | 16184 | 330 | 4189 | 2974 | 2118 |
| ENCODE Histone Modification Site Profiles | cell or tissue histone modification sites | 22382 | 330 | 437 | 432 | 91 |
| Roadmap Epigenomics Histone Modification Site Profiles | cell or tissue histone modification sites | 21032 | 313 | 385 | 295 | 282 |
| ENCODE Transcription Factor Binding Site Profiles | cell or tissue transcription factor binding sites | 22845 | 330 | 1681 | 1591 | 723 |
| JASPAR Predicted Transcription Factor Targets | cell or tissue transcription factor binding sites | 21547 | 330 | 113 | 80 | 77 |
| COMPARTMENTS Curated Protein Localization Evidence Scores | cellular compartment associations | 16738 | 330 | 1465 | 228 | 105 |
| COMPARTMENTS Experimental Protein Localization Evidence Scores | cellular compartment associations | 6495 | 73 | 61 | 37 | 10 |
| COMPARTMENTS Text-mining Protein Localization Evidence Scores | cellular compartment associations | 14375 | 330 | 2083 | 877 | 545 |
| GO Cellular Component Annotations | cellular compartment associations | 16757 | 328 | 1549 | 208 | 124 |
| LOCATE Curated Protein Localization Annotations | cellular compartment associations | 9639 | 269 | 80 | 50 | 20 |
| LOCATE Predicted Protein Localization Annotations | cellular compartment associations | 19747 | 325 | 26 | 23 | 10 |
| CTD Gene-Chemical Interactions | chemical interactions | 11125 | 321 | 9518 | 2222 | 2042 |
| Guide to Pharmacology Chemical Ligands of Receptors | chemical interactions | 899 | 209 | 4896 | 189 | 52 |
| Kinativ Kinase Inhibitor Bioactivity Profiles | chemical interactions | 232 | 9 | 28 | 28 | 25 |
| KinomeScan Kinase Inhibitor Targets | chemical interactions | 287 | 10 | 75 | 75 | 72 |
| CMAP Signatures of Differentially Expressed Genes for Small Molecules | chemical perturbation differentially expressed genes | 12148 | 300 | 6102 | 5066 | 5065 |
| ClinVar SNP-Phenotype Associations | disease or phenotype associations | 2458 | 143 | 3293 | 3 | 2 |
| CTD Gene-Disease Associations | disease or phenotype associations | 21582 | 331 | 6327 | 2926 | 2116 |
| dbGAP Gene-Trait Associations | disease or phenotype associations | 5668 | 147 | 512 | 51 | 49 |
| DISEASES Curated Gene-Disease Assocation Evidence Scores | disease or phenotype associations | 2252 | 115 | 772 | 94 | 49 |
| DISEASES Experimental Gene-Disease Assocation Evidence Scores | disease or phenotype associations | 4055 | 131 | 352 | 106 | 43 |
| DISEASES Text-mining Gene-Disease Assocation Evidence Scores | disease or phenotype associations | 15309 | 330 | 4630 | 2559 | 1850 |
| GAD Gene-Disease Associations | disease or phenotype associations | 10705 | 318 | 12780 | 1189 | 980 |
| GAD High Level Gene-Disease Associations | disease or phenotype associations | 8016 | 314 | 20 | 19 | 16 |
| GWAS Catalog Gene-Disease Associations | disease or phenotype associations | 4356 | 127 | 1009 | 30 | 28 |
| GWASdb SNP-Disease Associations | disease or phenotype associations | 11805 | 253 | 587 | 252 | 126 |
| GWASdb SNP-Phenotype Associations | disease or phenotype associations | 12488 | 261 | 824 | 397 | 150 |
| HPO Gene-Disease Associations | disease or phenotype associations | 3158 | 171 | 6844 | 1187 | 667 |
| HuGE Navigator Gene-Phenotype Associations | disease or phenotype associations | 12055 | 322 | 2755 | 1241 | 1153 |
| MPO Gene-Phenotype Associations | disease or phenotype associations | 7798 | 299 | 8581 | 2434 | 1444 |
| OMIM Gene-Disease Associations | disease or phenotype associations | 4553 | 209 | 6177 | 5 | 4 |
| GeneSigDB Published Gene Signatures | gene signatures or modules | 19723 | 331 | 3517 | 1363 | 1313 |
| MSigDB Cancer Gene Co-expression Modules | gene signatures or modules | 4869 | 135 | 358 | 135 | 95 |
| MiRTarBase microRNA Targets | microRNA targets | 12086 | 218 | 598 | 93 | 91 |
| TargetScan Predicted Conserved microRNA Targets | microRNA targets | 14923 | 283 | 1539 | 1020 | 791 |
| TargetScan Predicted Nonconserved microRNA Targets | microRNA targets | 18210 | 324 | 1541 | 1534 | 1236 |
| GO Biological Process Annotations | pathway, function, or process associations | 15717 | 328 | 13214 | 2436 | 1215 |
| GO Molecular Function Annotations | pathway, function, or process associations | 15777 | 327 | 4164 | 367 | 204 |
| HumanCyc Pathways | pathway, function, or process associations | 932 | 41 | 288 | 11 | 8 |
| KEGG Pathways | pathway, function, or process associations | 7016 | 298 | 303 | 185 | 179 |
| PANTHER Pathways | pathway, function, or process associations | 1962 | 138 | 147 | 40 | 39 |
| Reactome Pathways | pathway, function, or process associations | 9005 | 309 | 1814 | 289 | 159 |
| Wikipathways Pathways | pathway, function, or process associations | 4958 | 263 | 301 | 140 | 137 |
| DEPOD Substrates of Phosphatases | phosphatase interactions | 293 | 19 | 114 | 13 | 9 |
| NURSA Protein Complexes | protein complex associations | 9785 | 141 | 1798 | 1182 | 1181 |
| InterPro Predicted Protein Domain Annotations | protein domain associations | 18002 | 329 | 11017 | 119 | 63 |
| BioGRID Protein-Protein Interactions | protein interactions | 15270 | 306 | 15272 | 1191 | 1163 |
| DIP Protein-Protein Interactions | protein interactions | 2709 | 140 | 2711 | 32 | 24 |
| Guide to Pharmacology Protein Ligands of Receptors | protein interactions | 187 | 46 | 213 | 5 | 4 |
| IntAct Biomolecular Interactions | protein interactions | 12303 | 269 | 12305 | 422 | 417 |
| GTEx eQTL | SNP eQTL targets | 7898 | 107 | 7817 | 2 | 1 |
| TOTALS | NA | NA | NA | 174228 | 44092 | 28562 |
Fig 1Feature selection pipeline.
Each dataset took the form of a matrix with genes labeling the rows and features labeling the columns. We appended the mean and standard deviation computed across all features as two additional features. Step 1: We filtered the columns to eliminate redundant features, replacing each group of correlated features with the group average feature, where a group was defined as features with squared pair-wise correlation coefficient r2 ≥ 0.5. If the dataset mean feature was included in a group of correlated features, we replaced the group with the dataset mean. Step 2: We filtered the rows for targets with clinical trial outcomes of interest: targets of selective drugs approved for non-cancer indications (successes) and targets of selective drug candidates that failed in phase III clinical trials for non-cancer indications (failures). Step 3: We tested the significance of each feature as an indicator of success or failure using permutation tests to quantify the significance of the difference between the means of the successful and failed targets. We corrected for multiple hypothesis testing using the Benjamini-Yekutieli method to control the false discovery rate at 0.05 within each dataset. Step 4: We “stressed” the significant features with additional tests to assess their robustness and generalizability. For example, we used bootstrapping to estimate probabilities that the significance findings will replicate on similar sets of targets.
Features significantly correlated with phase III outcome.
| Dataset | Feature | Corr Pval | Correl-ation Sign | Correlated Target Classes (and sign) | Repl Prob (Bootstrap) | Repl Prob (Class Holdout Bootstrap) | Repl Prob (Within Class Permutation Bootstrap) |
|---|---|---|---|---|---|---|---|
| BioGPS Human Cell Type and Tissue Gene Expression Profiles | [mean] | 0.001 | -1 | GPCRs (-1) | 0.89 | 0.98 | 0.83 |
| BioGPS Human Cell Type and Tissue Gene Expression Profiles | stdv | 0.010 | -1 | GPCRs (-1), Integrins (+1) | |||
| BioGPS Mouse Cell Type and Tissue Gene Expression Profiles | [mean] | 0.042 | -1 | GPCRs (-1) | |||
| Allen Brain Atlas Adult Human Brain Tissue Gene Expression Profiles | [mean] | 0.006 | -1 | GPCRs (-1) | 0.80 | ||
| Allen Brain Atlas Adult Mouse Brain Tissue Gene Expression Profiles | r3 roof plate | 0.002 | -1 | None | 0.88 | 1.00 | 0.89 |
| Allen Brain Atlas Adult Mouse Brain Tissue Gene Expression Profiles | [mean] | 0.007 | -1 | None | 1.00 | ||
| GTEx Tissue Gene Expression Profiles | [mean] | 0.014 | -1 | GPCRs (-1) | |||
| GTEx Tissue Gene Expression Profiles | stdv | 0.014 | +1 | GPCRs (+1) | 0.94 | ||
| HPA Tissue Gene Expression Profiles | [mean] | 0.004 | -1 | GPCRs (-1) | 0.80 | 0.90 | 0.85 |
| HPA Tissue Gene Expression Profiles | stdv | 0.004 | +1 | None | 0.81 | 1.00 | 0.81 |
| TISSUES Experimental Tissue Protein Expression Evidence Scores | bone marrow | 0.001 | -1 | GPCRs (-1) | 0.92 | 0.96 | |
| TISSUES Experimental Tissue Protein Expression Evidence Scores | [hematopoietic cells] | 0.001 | -1 | GPCRs (-1), Integrins (+1) | 0.93 | 1.00 | |
| TISSUES Experimental Tissue Protein Expression Evidence Scores | [mean] | 0.001 | -1 | GPCRs (-1) | 0.85 | 0.99 | |
| TISSUES Experimental Tissue Protein Expression Evidence Scores | [epithalamus and pineal gland] | 0.012 | -1 | None | 0.97 | ||
| TISSUES Experimental Tissue Protein Expression Evidence Scores | erythroid cell | 0.015 | -1 | None | 0.94 | ||
| TISSUES Experimental Tissue Protein Expression Evidence Scores | [t-lymphocyte] | 0.017 | -1 | None | 0.95 | ||
| TISSUES Experimental Tissue Protein Expression Evidence Scores | [miscellaneous tissues] | 0.017 | -1 | GPCRs (-1) | |||
| TISSUES Experimental Tissue Protein Expression Evidence Scores | [thymus and thorax] | 0.017 | -1 | Integrins (+1) | |||
| TISSUES Experimental Tissue Protein Expression Evidence Scores | adrenal cortex | 0.043 | -1 | None |
Footnotes
Abbreviations: Corr Pval = p-value corrected for multiple hypothesis testing, Repl Prob = replication probability.
[Square brackets] denote groups of features.
[miscellaneous tissues] is a heterogeneous group of digestive, respiratory, urogenital, reproductive, nervous, cardiovascular, and hematopoietic system tissues.
White background indicates features that passed all tests for robustness and generalizability.
Gray background indicates features that failed at least one test for robustness or generalizability. Strikethrough italics indicates the failed test(s).
Fig 2Modeling pipeline.
We trained a classifier to predict phase III clinical trial outcomes, using 5-fold cross-validation repeated 200 times to assess the stability of the classifier and estimate its generalization performance. For each fold of cross-validation, modeling began with the non-redundant features for each dataset. Step 1: We split the targets with phase III outcomes into training and testing sets. Step 2: We performed univariate feature selection using permutation tests to quantify the significance of the difference between the means of the successful and failed targets in the training examples. We controlled for target class as a confounding factor by only shuffling outcomes within target classes. We accepted features with adjusted p-values less than 0.05 after correcting for multiple hypothesis testing using the Benjamini-Yekutieli method. Step 3: We aggregated significant features from all datasets into a single feature matrix. Step 4: We performed incremental feature elimination with an inner 5-fold cross-validation loop repeated 20 times to select the type of classifier (Random Forest or logistic regression) and smallest subset of features that had cross-validation area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPR) values within 95% of maximum. Step 5: We refit the selected model using all the training examples and evaluated its performance on the test examples.
Distribution of train-test cycles by classifier type and number of selected features.
| Selected Model Type | ||||
|---|---|---|---|---|
| Logistic Regression | Random Forest | Total | ||
| Selected Features | 1 | 662 | 5 | 667 |
| 2 | 82 | 84 | 166 | |
| 3 | 57 | 41 | 98 | |
| 4 | 22 | 2 | 24 | |
| 5 | 24 | 1 | 25 | |
| 6 | 11 | 0 | 11 | |
| 7 | 6 | 0 | 6 | |
| 8 | 2 | 0 | 2 | |
| Total | 866 | 133 | 999 | |
Footnotes
* 1 train-test cycle yielded no significant features for modeling
Number of train-test cycles in which feature was selected for the classifier.
| Feature Type | Feature | Count |
|---|---|---|
| cell or tissue expression | mean across tissues | 685 |
| cell or tissue expression | standard deviation across tissues | 585 |
| cell or tissue expression | other | 214 |
| disease or phenotype associations | mean across diseases | 2 |
| disease or phenotype associations | other | 2 |
| pathway, function, or process associations | any | 1 |
Fig 3Classifier performance.
(A) Receiver operating characteristic (ROC) curve. The solid black line indicates the median performance across 200 repetitions of 5-fold cross-validation and the gray area indicates the range of the 2.5 and 97.5 percentiles. The dotted black line indicates the performance of random rankings. (B) Distributions of the probability of success predicted by the classifier for the successful, failed, and unlabeled targets. (C) Precision-recall curve for success predictions. (D) Precision-recall curve for failure predictions. (E) Pairwise target comparisons. For each pair of targets, we computed the fraction of repetitions of cross-validation in which Target B had a higher predicted probability of success greater than Target A. The heatmap illustrates this fraction, thresholded at 0.95 or 0.99, plotted as a function of the median predicted probabilities of success of two targets. The upper left region is where the classifier is 95% (above solid black line) or 99% (above dotted blue line) consistent in predicting greater probability of success of Target B than Target A. (F) Relationship between features and phase III outcomes. Heat map showing the projection of the predicted success probabilities onto the two dominant features selected for the classifier: mean expression across tissues and standard deviation of expression across tissues. Red, white, and blue background colors correspond to 1, 0.5, and 0 success probabilities. Red plusses and blue crosses mark the locations of the success and failure examples. It appears the model has learned that failures tend to have high mean expression and low standard deviation of expression across tissues, while successes tend to have low mean expression and high standard deviation of expression. The success and failure examples are not well separated, indicating that we did not discover enough features to fully explain why targets succeed or fail in phase III clinical trials.
Classifier performance statistics.
| Statistic | 2.5 Percentile | Median | 97.5 Percentile |
|---|---|---|---|
| True Positives (TP) | 91 | 220 | 243 |
| False Positives (FP) | 16 | 52 | 65 |
| True Negatives (TN) | 5 | 16 | 52 |
| False Negatives (FN) | 1 | 24 | 154 |
| True Positive Rate (TPR) | 0.370 | 0.903 | 0.995 |
| False Positive Rate (FPR) | 0.232 | 0.762 | 0.928 |
| False Negative Rate (FNR) | 0.005 | 0.096 | 0.630 |
| True Negative Rate (TNR) | 0.072 | 0.237 | 0.768 |
| Misclassification Rate (MCR) | 0.206 | 0.241 | 0.542 |
| Accuracy (ACC) | 0.458 | 0.759 | 0.794 |
| False Discovery Rate (FDR) | 0.149 | 0.194 | 0.213 |
| Positive Predictive Value (PPV) | 0.787 | 0.806 | 0.851 |
| False Omission Rate (FOMR) | 0.233 | 0.583 | 0.741 |
| Negative Predictive Value (NPV) | 0.259 | 0.417 | 0.767 |
| Area Under Receiver Operating Characteristic Curve (AUROC) | 0.512 | 0.574 | 0.615 |
| Area Under Precision-Recall Curve (AUPR) | 0.777 | 0.811 | 0.836 |
| Positive Likelihood Ratio (PLR) | 1.058 | 1.184 | 1.619 |
| Negative Likelihood Ratio (NLR) | 0.086 | 0.402 | 0.819 |
| Diagnostic Odds Ratio (DOR) | 1.748 | 3.066 | 13.344 |
| Risk Ratio (RR) | 1.143 | 1.387 | 3.447 |
| Matthews Correlation Coefficient (MCC) | 0.100 | 0.178 | 0.251 |
Examples of successful tissue specific targets.
| Target | Indication | Expression | Outcome |
|---|---|---|---|
| PNLIP | Pancreatic insufficiency | Pancreas 5-fold higher than other tissues | Success |
| MMP8 | Osteoarthritis | Bone marrow 4-fold higher than other tissues | Success |
| ATP4A | Ulcer, gastro-esophageal reflux | Stomach 3-fold higher than other tissues | Success |
| GABRA1 | Neurological diseases (anxiety, depression, addiction, pain, insomnia, epilepsy) | Brain 3-fold higher than other tissues | Success |
Examples of failed tissue specific targets with plausible exceptions.
| Target | Indication | Expression | Outcome | Exception |
|---|---|---|---|---|
| BPI | Bacterial infections | Bone marrow 3-fold higher than other tissues | Failure | The drug is recombinant BPI, which is used for its anti-bacterial properties, thus modulation of endogenous BPI is not directly relevant to efficacy of the therapy |
| TSHR | Goiter | Thyroid 4-fold higher than other tissues | Failure | The trial was canceled before enrollment, thus perhaps TSHR should not be counted as a phase III failure |
Examples of failed ubiquitously expressed targets.
| Target | Indication | Expression | Outcome |
|---|---|---|---|
| DPP8 | Heart failure | Ubiquitous | Failure |
| CSNK2B | Human papilloma virus infection | Ubiquitous | Failure |
Examples of successful ubiquitously expressed targets with plausible exceptions.
| Target | Indication | Expression | Outcome | Exception |
|---|---|---|---|---|
| MTOR | Restenosis | Ubiquitous | Success | Tissue specificity is achieved via the delivery method (drug eluting stent) |
| IFNAR1 | Eye infections | Ubiquitous | Success | Tissue specificity is achieved via the delivery method (eye drops) |
| GBA | Gaucher’s disease | Ubiquitous | Success | Gaucher’s disease is a loss-of-function genetic disorder affecting multiple organ systems, thus therapy requires system-wide replacement of the defective enzyme |