| Literature DB >> 32533004 |
Laura-Jayne Gardiner1, Anna Paola Carrieri2, Jenny Wilshaw2, Stephen Checkley3, Edward O Pyzer-Knapp2, Ritesh Krishna4.
Abstract
During the development of new drugs or compounds there is a requirement for preclinical trials, commonly involving animal tests, to ascertain the safety of the compound prior to human trials. Machine learning techniques could provide an in-silico alternative to animal models for assessing drug toxicity, thus reducing expensive and invasive animal testing during clinical trials, for drugs that are most likely to fail safety tests. Here we present a machine learning model to predict kidney dysfunction, as a proxy for drug induced renal toxicity, in rats. To achieve this, we use inexpensive transcriptomic profiles derived from human cell lines after chemical compound treatment to train our models combined with compound chemical structure information. Genomics data due to its sparse, high-dimensional and noisy nature presents significant challenges in building trustworthy and transparent machine learning models. Here we address these issues by judiciously building feature sets from heterogenous sources and coupling them with measures of model uncertainty achieved through Gaussian Process based Bayesian models. We combine the use of insight into the feature-wise contributions to our predictions with the use of predictive uncertainties recovered from the Gaussian Process to improve the transparency and trustworthiness of the model.Entities:
Mesh:
Year: 2020 PMID: 32533004 PMCID: PMC7293302 DOI: 10.1038/s41598-020-66481-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1True versus predicted test values from BUN level prediction. Scatter plots showing the true (x-axis) versus predicted (y-axis) test values using: (a) gene expression data as training data, (b) chemical structure information as training data and (c) both gene expression and chemical structure as training data. Scatter plots showing the true (x-axis) versus predicted (y-axis) test values using gene expression and chemical structure as training data after: (d) PCA using 57 components, (e) tSVD using 57 components and (d) tSVD h. All scatterplots show marginal histograms with regression and kernel density fits. The 95% confidence interval for the regression estimate is drawn using translucent bands around the regression line. Datapoint colour is according to the standard deviation of the predictive distribution per point; scales for colour vary between plots. Figure (f) shows our best model.
ML analyses to predict BUN levels in rats.
| Training dataset | L1000 gene expression | Chemical structure | L1000 gene expression plus chemical structure | L1000 gene expression plus chemical structure | L1000 gene expression plus chemical structure | L1000 gene expression plus chemical structure | L1000 gene expression plus chemical structure |
|---|---|---|---|---|---|---|---|
| Regressor | Gaussian Process | Gaussian Process | Gaussian Process | Gaussian Process + PCA 57 | Gaussian Process + tsvd 57 | Gaussian Process + tsvd hierarchical | Light GBM |
| Test set MAE score using best parameters | 2.156 | 2.156 | 2.365 | 2.700 | 2.452 | 1.544 | |
| Test set RMSE score using best parameters | 3.708 **3.708 | 3.708 **3.708 | 2.961 **2.240 | 3.359 **2.157 | 3.176 **2.047 | 2.146 | |
| Test set r2 score using best parameters | 0.013 | 0.028 | 0.418 | 0.464 | 0.520 | 0.656 | |
| Average SD of the predictive distributions (ASD) | 3.283 | 3.283 | 0.683 | 0.427 | 0.425 | — |
Showing the results from the analysis to compare the effect of combining chemical and gene expression data, the effect of using dimensionality reduction and finally comparing the best alternative classifier from the 8 tested, defined as producing the highest test prediction accuracy (lowest MAE balanced with highest r2 score). The best model overall is highlighted in bold. **Weighted RMSE.
Figure 2Results from BUN level prediction ML comparative regression analysis. Bar charts showing the RMSE and MAE scores for the test datasets (using best parameters) and the mean MAE training scores after 10-fold cross validation with the standard deviation shown as error bars. Left y-axis is used for MAE/RMSE scores. Line plots show the R2 score for test datasets (using best parameters) with the right y-axis used for R2 scores. For Linear Regression, SVM, KNN, XGBoost and GP tSVDh dimensionality reduction was used whereas for other approaches it was not since resulting models were more predictive without it.
Figure 3Investigation into feature importance for ML model. Showing the results from (a) the ExtraTreesRegressor approach, trained on L1000 gene expression data and chemical structure information. Showing feature importance as a bar chart for the 964 landmark genes and 166-bit MACCS chemical fingerprints, alongside a cumulative line plot of feature importance and (b) the highest feature contributions (>0.5) for the gene expression-based dimension with the shortest length scale in our best GP model.