| Literature DB >> 26658480 |
Othman Soufan1, Wail Ba-alawi1, Moataz Afeef1, Magbubah Essack1, Valentin Rodionov2, Panos Kalnis3, Vladimir B Bajic1.
Abstract
High-throughput screening (HTS) experiments provide a valuable resource that reports biological activity of numerous chemical compounds relative to their molecular targets. Building computational models that accurately predict such activity status (active vs. inactive) in specific assays is a challenging task given the large volume of data and frequently small proportion of active compounds relative to the inactive ones. We developed a method, DRAMOTE, to predict activity status of chemical compounds in HTP activity assays. For a class of HTP assays, our method achieves considerably better results than the current state-of-the-art-solutions. We achieved this by modification of a minority oversampling technique. To demonstrate that DRAMOTE is performing better than the other methods, we performed a comprehensive comparison analysis with several other methods and evaluated them on data from 11 PubChem assays through 1,350 experiments that involved approximately 500,000 interactions between chemicals and their target proteins. As an example of potential use, we applied DRAMOTE to develop robust models for predicting FDA approved drugs that have high probability to interact with the thyroid stimulating hormone receptor (TSHR) in humans. Our findings are further partially and indirectly supported by 3D docking results and literature information. The results based on approximately 500,000 interactions suggest that DRAMOTE has performed the best and that it can be used for developing robust virtual screening models. The datasets and implementation of all solutions are available as a MATLAB toolbox online at www.cbrc.kaust.edu.sa/dramote and can be found on Figshare.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26658480 PMCID: PMC4682830 DOI: 10.1371/journal.pone.0144426
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary of experimental datasets including reference IDs in PubChem Database.
| Dataset | Target Name (Target) | Type of interacting compounds | Minority Class Size | Majority Class Size | IR Ratio |
|---|---|---|---|---|---|
| BenchSet (AID: 773, AID: 1006 and AID: 1379) | Luciferase [Photuris pennsylvanica](Protein) | Inhibitors | 487 | 184,154 | 1:377 |
| AID 596 | Microtubule-associated protein tau [Homo sapiens] (Protein) | Binders | 1,391 | 66,726 | 1:48 |
| AID 618 | Matrix metalloproteinase 1, partial [Homo sapiens] (Protein) | Inhibitors | 537 | 86,197 | 1:160 |
| AID 644 | Rho-associated protein kinase 2 [Homo sapiens] (Protein) | Inhibitors | 67 | 139 | 1:2 |
| AID 886 | Chain B, The Structure Of Wild-Type Human Hadh2 (Protein) | Inhibitors | 2,463 | 64,616 | 1:26 |
| AID 899 | Cytochrome P450 2C19 precursor [Homo sapiens] (Protein) | Inhibitors and Substrates | 1,901 | 6,443 | 1:3 |
| AID 938 | Thyroid stimulating hormone receptor [Homo sapiens] (Protein) | Agonist Activators | 1,794 | 60,806 | 1:34 |
| AID 743042 | Androgen receptor [Homo sapiens] (Protein) | Antagonist Activators | 674 | 6,939 | 1:10 |
| AID 743288 | Hek293 cell line (Cell) | Binders | 95 | 2,128 | 1:22 |
| Total Interactions |
| ||||
Fig 1Illustration of generating synthetic instances.
A) SMOTE generates the light blue samples by interpolation between a randomly chosen minority sample and k-nearest neighbors. B) DRAMOTE generates the light blue samples by choosing a minority sample based on its importance (i.e. contribution to precision) and the direction towards a safe region. A minority sample (red colored) that is very close to the majority negatives circles will be probably misclassified as a negative one and hence, it should get more support compared to the green colored minority samples. Once a minority sample is chosen, another point needs to be chosen for interpolation. The direction of interpolation can be controlled by choosing a nearest neighbor which is not overlapping with the negative class. This, in turn, helps in providing support for the red colored point while not harming the classifier performance in its surrounding region.
Comparison of the data preprocessing methods.
Larger standard deviation values are the result of averaging over different types of classifiers in this summary table.
| Dataset | Method | Sensitivity % | Precision % | F1 Score % | F0.5 Score % |
|---|---|---|---|---|---|
|
| RU | 85.67 (±2.5) | 1.07 (±0.29) | 2.93 (±0.56) | 1.33 (±0.35) |
| GSVM-RU | 68.53 (±6) | 2.73 (±2.05) | 5.13 (±3.7) | 3.36 (±2.49) | |
| SMOTE | 62.79 (±15.32) | 10.44 (±16.11) |
| 10.87 (±15.17) | |
| MWMOTE | 69.49 (±13.18) | 4.9 (±6.7) | 7.87 (±9.08) | 5.75 (±7.52) | |
| DRAMOTE | 58.14 (±19.2) |
| 11.62 (±11.42) |
| |
| [ |
| 5 | NA | NA | |
|
| RU | 75.9 (±3.04) | 5.3 (±1.17) | 9.89 (±2.07) | 6.51 (±1.41) |
| GSVM-RU |
| 4.56 (±2.78) | 8.46 (±4.78) | 5.59 (±3.34) | |
| SMOTE | 64.02 (±13.8) | 10.9 (±8.95) | 16.38 (±9.28) | 12.47 (±9.16) | |
| MWMOTE | 62.1 (±14.3) | 10.8 (±9.2) | 16.11 (±9.34) | 12.32 (±9.37) | |
| DRAMOTE | 42.9 (±13.52) |
|
|
| |
|
| RU | 72.54 (±3.41) | 1.38 (±0.31) | 2.7 (±0.59) | 1.71 (±0.38) |
| GSVM-RU |
| 2.64 (±1.48) | 4.89 (±2.59) | 3.24 (±1.79) | |
| SMOTE | 43.01 (±17.87) | 10.07 (±12.36) | 10.93 (±8.36) | 10.01 (±10.42) | |
| MWMOTE | 42.34 (±18.53) | 10.31 (±12.72) |
| 10.24 (±10.49) | |
| DRAMOTE | 29.69 (±15.26) |
| 9.73 (±6.09) |
| |
|
| RU | 50.29 (±4.46) | 35.08 (±2.56) | 40.32 (±3.1) | 37.3 (±2.49) |
| GSVM-RU |
| 36.02 (±2.51) |
| 39.84 (±2.48) | |
| SMOTE | 47.3 (±14.1) | 41.78 (±7.23) | 40.95 (±3.21) | 41.65 (±3.72) | |
| MWMOTE | 47.37 (±12.37) | 42.22 (±6.68) | 41.99 (±3.24) |
| |
| DRAMOTE | 40.09 (±8.51) |
| 38.84 (±1.64) | 41.49 (±5.79) | |
|
| RU |
| 67.65 (±2.55) | 80.52 (±1.75) | 72.27 (±2.31) |
| GSVM-RU | 99.25 (±0.97) | 54.51 (±26.52) | 65.87 (±29.63) | 58.53 (±27.76) | |
| SMOTE | 96.94 (±4.11) | 75.2 (±4.92) |
|
| |
| MWMOTE | 97.03 (±3.27) | 74.32 (±4.81) | 83.98 (±2.75) | 77.9 (±4.06) | |
| DRAMOTE | 94.38 (±8.1) |
| 83.55 (±3.72) | 78.56 (±4.17) | |
|
| RU | 77.65 (±3.43) | 45.96 (±7.07) | 57.33 (±5.46) | 49.89 (±6.7) |
| GSVM-RU |
| 25.82 (±2.6) | 40.69 (±2.84) | 30.25 (±2.76) | |
| SMOTE | 70.44 (±8.14) | 53.52 (±14.02) |
|
| |
| MWMOTE | 70.5 (±8.48) | 52.61 (±13.66) | 58.55 (±6.55) | 54.55 (±10.83) | |
| DRAMOTE | 64.51 (±8.01) |
| 56.73 (±5.38) | 54.47 (±10.69) | |
|
| RU |
| 66.17 (±2) | 79.4 (±1.45) | 37.3 (±2.49) |
| GSVM-RU | 99.16 (±0.5) | 45.85 (±17.01) | 56.79 (±17.22) | 49.64 (±24.09) | |
| SMOTE | 91.86 (±0.9) | 80.05 (±1.8) | 84 (±1.34) | 81.94 (±11.11) | |
| MWMOTE | 94.49 (±8.2) | 70.7 (±8) | 80.74 (±1.9) | 74.41 (±6.24) | |
| DRAMOTE | 91.39 (±4) |
|
|
| |
|
| RU | 71.34 (±7.44) | 17.22 (±2.83) | 27.66 (±4) | 20.28 (±3.21) |
| GSVM-RU |
| 11.11 (±0.65) | 19.81 (±0.9) | 13.47 (±0.74) | |
| SMOTE | 33.38 (±16.32) | 36.99 (±21.61) | 27.71 (±8.52) | 29.84 (±10.97) | |
| MWMOTE | 35.52 (±14.9) | 36.54 (±18.4) | 30.56 (±7.01) | 32.18 (±9.78) | |
| DRAMOTE | 35.38 (±14.13) |
|
|
| |
|
| RU | 68.09 (±5.53) | 8.38 (±1.07) | 14.89 (±1.77) | 10.16 (±1.27) |
| GSVM-RU |
| 5.76 (±0.4) | 10.78 (±0.68) | 7.08 (±0.48) | |
| SMOTE | 25.74 (±18.34) | 26.99 (±23.95) | 24.56 (±6.5) | 24.05 (±10.15) | |
| MWMOTE | 23.8 (±17.4) | 33.02 (±21.18) | 23.75 (±9.67) | 25.78 (±10.32) | |
| DRAMOTE | 27.88 (±14.66) |
|
|
|
a NA indicates that a particular measure was not reported in the referenced work
Ranking of methods based on F1Score for every classifier.
| Classifier | RU | GSVM-RU | SMOTE | MWMOTE | DRAMOTE |
|---|---|---|---|---|---|
| SVM-L | 3 | 5 | 1 | 4 | 2 |
| SVM-R | 4 | 5 | 2 | 3 | 1 |
| KNN | 3 | 5 | 2 | 4 | 1 |
| LDA | 4 | 5 | 2 | 3 | 1 |
| NBC | 1 | 4 | 3 | 5 | 2 |
| RF | 4 | 5 | 1 | 3 | 2 |
| Average | 3.17 | 4.83 |
| 3.67 |
|
Top 10 ranked predictions by DRAMOTE for BioAssay 938 with TSHR protein target.
| Rank | DrugBank ID | Drug Name | Description | Ensemble System Score |
|---|---|---|---|---|
| 1 | DB00904 | Ondansetron | Treatment of nausea and vomiting caused by cytotoxic chemotherapy drugs | 0.98 |
| 2 | DB00962 | Zaleplon | Sedative/hypnotic, mainly used for insomnia | 0.97 |
| 3 | DB01349 | Tasosartan | Treat patients with essential hypertension | 0.966 |
| 4 | DB00405 | Dexbrompheniramine | Treat allergic conditions such as hay fever or urticaria | 0.96 |
| 5 | DB01261 | Sitagliptin | Control of type 2 diabetes mellitus | 0.958 |
| 6 | DB06439 | Tyloxapol | Non-ionic detergent often used as a surfactant | 0.957 |
| 7 | DB00889 | Granisetron | Antiemetic and antinauseant for cancer chemotherapy patients | 0.954 |
| 8 | DB01342 | Forasartan | Used alone or with other antihypertensive agents to treat hypertension | 0.953 |
| 9 | DB00748 | Carbinoxamine | First generation antihistamine that competes with free histamine for binding at HA-receptor sites | 0.95 |
| 10 | DB06267 | Udenafil | Treat erectile dysfunction | 0.945 |
Fig 2Boxplot over free energy of binding and RMSD values for experimental, random and DRAMOTE docking results.
The random set is based on choosing 10 random drugs from approved drugs list in DrugBank database. The experimental set includes the top 10 drugs as listed in the original BioAssay AID 938 of PubChem database.