| Literature DB >> 32190647 |
Sukyung Seo1, Taekeon Lee1, Mi-Hyun Kim2, Youngmi Yoon1.
Abstract
Identifying the potential side effects of drugs is crucial in clinical trials in the pharmaceutical industry. The existing side effect prediction methods mainly focus on the chemical and biological properties of drugs. This study proposes a method that uses diverse information such as drug-drug interactions from DrugBank, drug-drug interactions from network, single nucleotide polymorphisms, and side effect anatomical hierarchy as well as chemical structures, indications, and targets. The proposed method is based on the assumption that properties used in drug repositioning studies could be utilized to predict side effects because the phenotypic expression of a side effect is similar to that of the disease. The prediction results using the proposed method showed a 3.5% improvement in the area under the curve (AUC) over that obtained when only chemical, indication, and target features were used. The random forest model delivered outstanding results for all combinations of feature types. Finally, after identifying candidate side effects of drugs using the proposed method, the following four popular drugs were discussed: (1) dasatinib, (2) sitagliptin, (3) vorinostat, and (4) clonidine.Entities:
Year: 2020 PMID: 32190647 PMCID: PMC7064827 DOI: 10.1155/2020/1357630
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1System overview. (a) To build a set of features for a drug-side effect pair, the maximum similarity was selected for each feature based on the known associations from the training samples between the side effect and other drugs. (b) At this step, the maximum side effect anatomical hierarchy similarity was chosen based on the known side effects of the drug from the training samples. (c) By assigning values, as was done for (a) and (b), datasets for machine learning were created with different combinations of features, and diverse classification algorithms, including a random forest, XGBoost, logistic regression, and naive Bayesian model, were applied to predict the relationship between a side effect and a drug. (d) Stacking ensemble learning that incorporated all four classifiers as its base classifiers and used a neural network as its meta classifier was applied.
Information on features used in this study.
| Name | Description | Source |
|---|---|---|
| Drug-drug interactions from DrugBank (DDIs-D) | Change in the efficacy or toxicity of one drug caused by another drug | DrugBank |
| Drug-drug interactions from network (DDIs-N) | STRING | |
|
| ||
| Single nucleotide polymorphism (SNP) | Substitution of a single nucleotide that occurs at a specific position in the genome | Fagny et al. [ |
|
| ||
| Target protein | Functional biomolecule controlled by drugs | DrugBank |
|
| ||
| Chemical structure | A molecule represented by a graph with nodes (atoms) and edges (bonds) | DrugBank, PubChem |
|
| ||
| Indication | Use of drugs for treating particular diseases | TTD, CTD, repoDB |
|
| ||
| Side effect anatomical hierarchy | Anatomical characteristic of a side effect | Wadhaw et al. [ |
Figure 2Schema of side effect anatomical hierarchy.
Figure 3Assigning values to drug-side effect pairs.
Figure 4Base classifiers and stacking ensemble learning.
Contingency table.
| Predicted actual | True | False | Row total |
|---|---|---|---|
| True |
|
|
|
| False |
|
|
|
| Column total |
|
|
|
Number of pairs that are used in each set.
| Number of pairs in the positive set | Number of pairs in the negative set | Total number of pairs used in the set | |
|---|---|---|---|
| Training set | 61,316 | 61,316 | 122,632 |
| Validation set | 7,664 | 7,664 | 15,328 |
| Test set | 7,665 | 7,665 | 15,330 |
Averaged AUCs from our dataset for 100 hold-out validation runs of our machine learning algorithms.
| Validation set | Test set | |||||||
|---|---|---|---|---|---|---|---|---|
| Type of feature set | RF | NB | XGB | LR | RF | NB | XGB | LR |
| All features |
| 0.8713 | 0.8917 | 0.8641 |
| 0.8713 | 0.8921 | 0.8642 |
| Base + DDIs-N + SNPs + DDIs-D | 0.8964 | 0.8575 | 0.8834 | 0.8501 | 0.8973 | 0.8575 | 0.8835 | 0.8501 |
| Base + SE-AH + SNPs + DDIs-D | 0.8961 | 0.8710 | 0.8896 | 0.8655 | 0.8970 | 0.8710 | 0.8896 | 0.8656 |
| Base + SE-AH + DDIs-N + DDIs-D | 0.8947 | 0.8645 | 0.8859 | 0.8550 | 0.8959 | 0.8644 | 0.8863 | 0.8550 |
| Base + SNPs + DDIs-N + SE-AH | 0.8940 | 0.8645 | 0.8868 | 0.8563 | 0.8951 | 0.8644 | 0.8870 | 0.8563 |
| Base + SNPs + DDIs-N | 0.8905 | 0.8572 | 0.8809 | 0.8519 | 0.8913 | 0.8572 | 0.8809 | 0.8519 |
| Base + DDIs-D + DDIs-N | 0.8901 | 0.8505 | 0.8773 | 0.8400 | 0.8911 | 0.8505 | 0.8773 | 0.8401 |
| Base + SNPs + DDIs-N | 0.8886 | 0.8496 | 0.8775 | 0.8415 | 0.8897 | 0.8496 | 0.8776 | 0.8414 |
| Base + DDIs-D + SE-AH | 0.8879 | 0.8641 | 0.8830 | 0.8563 | 0.8890 | 0.8640 | 0.8830 | 0.8563 |
| Base + SE-AH + SNPs | 0.8874 | 0.8641 | 0.8840 | 0.8574 | 0.8885 | 0.8690 | 0.8840 | 0.8575 |
| Base + SE-AH + DDIs-N | 0.8849 | 0.8540 | 0.8783 | 0.8422 | 0.8863 | 0.8539 | 0.8785 | 0.8421 |
| Base + DDIs-D | 0.8812 | 0.8502 | 0.8736 | 0.8432 | 0.8820 | 0.8502 | 0.8736 | 0.8431 |
| Base + SNPs | 0.8798 | 0.8493 | 0.8744 | 0.8440 | 0.8810 | 0.8492 | 0.8744 | 0.8440 |
| Base + DDIs-N | 0.8788 | 0.8389 | 0.8681 | 0.8281 | 0.8802 | 0.8389 | 0.8683 | 0.8279 |
| Base + SE-AH | 0.8761 | 0.8535 | 0.8740 | 0.8430 | 0.8774 | 0.8533 | 0.8741 | 0.8429 |
| Base | 0.8659 | 0.8385 | 0.8634 | 0.8313 | 0.8673 | 0.8385 | 0.8636 | 0.8310 |
Performance measurements using all features.
| Specificity | Precision | Recall | F1 | |
|---|---|---|---|---|
| Validation set | ||||
| RF | 0.8113 | 0.8167 | 0.8407 | 0.8285 |
| NB | 0.7521 | 0.7655 | 0.8245 | 0.7957 |
| LR | 0.7913 | 0.7934 | 0.8016 | 0.7975 |
| XGB | 0.8161 | 0.8169 | 0.8207 | 0.8188 |
|
| ||||
| Test set | ||||
| RF | 0.8126 | 0.8178 | 0.8473 | 0.8294 |
| NB | 0.7514 | 0.7682 | 0.8240 | 0.7951 |
| LR | 0.7912 | 0.7933 | 0.8014 | 0.7973 |
| XGB | 0.8144 | 0.7154 | 0.8196 | 0.8175 |
Representation of feature importance by ranking.
| DDIs-D | SNPs | Indication | DDIs-N | Target | Chemical | SE-AH | |
|---|---|---|---|---|---|---|---|
| RF | 1 | 2 | 4 | 6 | 5 | 3 | 7 |
| LR | 4 | 3 | 5 | 7 | 6 | 2 | 1 |
| XGB | 2 | 3 | 5 | 7 | 1 | 4 | 6 |
| NB | 1 | 2 | 5 | 6 | 4 | 3 | 7 |
Figure 5Feature importance of RF.
Contingency table for FAERS and predictions from proposed method.
| Predictions FAERS | True | False |
|---|---|---|
| True | 12,648 | 7,797 |
| False | 41,446 | 53,780 |
Contingency table for MedEffect and predictions from proposed method.
| Predictions MedEffect | True | False |
|---|---|---|
| True | 377 | 1,145 |
| False | 4,089 | 15,233 |
Comparison with the method of Zhao et al.
| AUC | Specificity | Precision | Recall | F1 | |
|---|---|---|---|---|---|
| Proposed method (stacking) |
| 0.8104 | 0.8174 | 0.8483 | 0.8325 |
| Proposed method (RF) |
| 0.8126 | 0.8178 | 0.8473 | 0.8294 |
| Zhao et al. method (2019) | 0.8977 | 0.880 | 0.760 | 0.761 | 0.761 |
| Zhao et al. method (2018) | 0.8492 | 0.759 | 0.766 | 0.791 | 0.778 |