| Literature DB >> 29180758 |
Daniel M Bean1, Honghan Wu1, Ehtesham Iqbal1, Olubanke Dzahini2,3, Zina M Ibrahim1,4, Matthew Broadbent2, Robert Stewart2,5, Richard J B Dobson6,7.
Abstract
Unknown adverse reactions to drugs available on the market present a significant health risk and limit accurate judgement of the cost/benefit trade-off for medications. Machine learning has the potential to predict unknown adverse reactions from current knowledge. We constructed a knowledge graph containing four types of node: drugs, protein targets, indications and adverse reactions. Using this graph, we developed a machine learning algorithm based on a simple enrichment test and first demonstrated this method performs extremely well at classifying known causes of adverse reactions (AUC 0.92). A cross validation scheme in which 10% of drug-adverse reaction edges were systematically deleted per fold showed that the method correctly predicts 68% of the deleted edges on average. Next, a subset of adverse reactions that could be reliably detected in anonymised electronic health records from South London and Maudsley NHS Foundation Trust were used to validate predictions from the model that are not currently known in public databases. High-confidence predictions were validated in electronic records significantly more frequently than random models, and outperformed standard methods (logistic regression, decision trees and support vector machines). This approach has the potential to improve patient safety by predicting adverse reactions that were not observed during randomised trials.Entities:
Mesh:
Year: 2017 PMID: 29180758 PMCID: PMC5703951 DOI: 10.1038/s41598-017-16674-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Overview of the prediction algorithm. (a) Starting from a knowledge graph containing all publicly available information on the ADR being predicted, an enrichment test is used to identify predictive features of the drugs known to cause the ADR. The total adjacency of every drug with all predictors of each type (the columns of the matrix) is calculated from the graph. Blue nodes are drugs, red nodes are ADRs, orange nodes are targets, green nodes are indications. (b) The features (adjacency matrix from (a)) are scaled and weighted to produce a final score for every drug. (c) The optimum weight vector from (b) is learned from the knowledge graph to maximize an objective function. The predictions of this optimized model are tested in EHRs.
Size of the drug knowledge graph. Raw data was filtered to retain only marketed drugs with at least one known ADR, target and indication.
| Network | Drug nodes | Other nodes | Total nodes | Edges |
|---|---|---|---|---|
| Drug-ADR | 524 | 3144 | 3668 | 62380 |
| Drug-Target | 524 | 736 | 1260 | 2610 |
| Drug-Indication | 524 | 1424 | 1948 | 5392 |
| Total | 524 | 5304 | 5828 | 70382 |
Figure 2Trained models outperform random and standard models in simulated prediction tasks. (a) Distribution of the proportion of deleted edges that was correctly predicted by each method for each ADR, as an average over all folds. (b) Average proportion of deleted edges correctly predicted by each algorithm for all ADRs. (c) Proportion of deleted edges predicted by trained models compared to the expected proportion achieved by a random model. Solid diagonal line represents identical performance. Points above the line indicate the trained model performed better than random. DT = Decision Trees, LR = Logistic Regression, SVM = Support Vector Machines.
ADRs for which we attempted to validate novel predicted drug associations in the EHR. The “known” column refers to the total number of drugs in the knowledge graph with an edge to each ADR.
| Name | UMLS | Known | AUC | High confidence |
|---|---|---|---|---|
| Akathisia | C0392156 | 46 | 0.951 | True |
| Alopecia | C0002170 | 215 | 0.857 | False |
| Amenorrhoea | C0002453 | 62 | 0.877 | True |
| Galactorrhoea | C0235660 | 49 | 0.887 | True |
| Hyperprolactinaemia | C0020514 | 14 | 0.973 | True |
| Hypersalivation | C0013132 | 12 | 0.982 | True |
| Neuroleptic Malignant Syndrome | C0027849 | 43 | 0.965 | True |
| Pericarditis | C0031046 | 28 | 0.895 | False |
| Pulmonary embolism | C0034065 | 67 | 0.898 | False |
| Stevens-Johnson syndrome | C0038325 | 165 | 0.842 | False |
Validation of trained models in EHR data. N = number of drugs predicted to cause the ADR that were tested in the EHR data. V = number of predicted drugs that were associated with the ADR (validated) in EHR data. E = expected number of validated predictions given N and the proportion of all drugs that are associated in the EHR. Random models generate N predictions for each ADR, and the trained model is considered significant if <5% of 100,000 random models had an equal or greater validation rate.
| Name | N | V | E | Proportion random ≥ trained | Significant | High confidence |
|---|---|---|---|---|---|---|
| Akathisia | 22 | 9 | 5 | 0.0337 | True | True |
| Alopecia | 18 | 5 | 4 | 0.3556 | False | False |
| Amenorrhoea | 20 | 6 | 3 | 0.0386 | True | True |
| Galactorrhoea | 22 | 5 | 2 | 0.0329 | True | True |
| Hyperprolactinaemia | 14 | 12 | 6 | 6.00E-04 | True | True |
| Hypersalivation | 20 | 15 | 9 | 8.00E-04 | True | True |
| Neuroleptic Malignant Syndrome | 20 | 6 | 3 | 0.0612 | False | True |
| Pericarditis | 19 | 11 | 3 | 1.00E-04 | True | False |
| Pulmonaryembolism | 22 | 8 | 5 | 0.1015 | False | False |
| Stevens-Johnson syndrome | 22 | 4 | 2 | 0.1434 | False | False |
Prediction performance compared to other methods. By definition the method developed in this paper makes predictions for all 10 of the validation ADRs. The average percent of random models with better performance is calculated considering only the ADRs with at least one validated prediction. LR = logistic regression, DT = decision trees, SVM = support vector machines.
| Method | Percent of all ADRs with new predictions | Validation ADRs with new predictions | ADRs with ≥1 predictions validated | Trained model outperforms >95% of random models | Average % of random models with better performance |
|---|---|---|---|---|---|
| This paper | 91.1 | (10/10) | 10/10 | 6/10 (60%) | 7.7 |
| LR | 52.8 | 7/10 | 6/7 | 1/7 (14%) | 24.6 |
| DT | 89.7 | 10/10 | 4/10 | 1/10 (10%) | 24.8 |
| SVM | 46.7 | 4/10 | 4/4 | 1/4 (25%) | 36.7 |
The ten highest-scoring predicted ADRs that were not present in the drug knowledge graph and were validated in EHRs. The number of reports of each drug-ADR pair (“Drug + ADR”) and the total number of reports of all ADRs for each drug (“Drug (all)”) are shown for both the EHR used for validation and the EudraVigilance database. The total ADR reports for each drug in the EHR only includes the 10 ADRs used for validation. The EudraVigilance reports include all cases for all ADRs reported in the dataset up to August 2017 (accessed October 2017). Note that the ratio of “Drug + ADR” to “Drug (all)” is expected to be much larger in the EHR as only 10 ADRs are considered, vs all ADRs for EudraVigilance.
| Drug | ADR | EHR | EudraVigilance | ||
|---|---|---|---|---|---|
| Drug + ADR | Drug (all) | Drug + ADR | Drug (all) | ||
| Imipramine | Akathisia | 2 | 4 | 5 | 1,465 |
| Trimipramine | Akathisia | 1 | 2 | 3 | 931 |
| Amitriptyline | Akathisia | 2 | 13 | 16 | 8,832 |
| Quetiapine | Alopecia | 18 | 20 | 74 | 34,010 |
| Mirtazapine | Neuroleptic Malignant Syndrome | 2 | 81 | 81 | 10,215 |
| Clomipramine | Pulmonary Embolism | 1 | 8 | 15 | 3,676 |
| Lamotrigine | Pulmonary Embolism | 8 | 63 | 29 | 21,168 |
| Donepezil | Pulmonary Embolism | 7 | 12 | 15 | 5,129 |
| Haloperidol | Pulmonary Embolism | 6 | 53 | 99 | 9,532 |
| Aripiprazole | Stevens-Johnson Syndrome | 3 | 564 | 29 | 17,758 |