| Literature DB >> 35163865 |
Vishnu Sripriya Akondi1, Vineetha Menon1, Jerome Baudry2, Jana Whittle2.
Abstract
Most contemporary drug discovery projects start with a 'hit discovery' phase where small chemicals are identified that have the capacity to interact, in a chemical sense, with a protein target involved in a given disease. To assist and accelerate this initial drug discovery process, 'virtual docking calculations' are routinely performed, where computational models of proteins and computational models of small chemicals are evaluated for their capacities to bind together. In cutting-edge, contemporary implementations of this process, several conformations of protein targets are independently assayed in parallel 'ensemble docking' calculations. Some of these protein conformations, a minority of them, will be capable of binding many chemicals, while other protein conformations, the majority of them, will not be able to do so. This fact that only some of the conformations accessible to a protein will be 'selected' by chemicals is known as 'conformational selection' process in biology. This work describes a machine learning approach to characterize and identify the properties of protein conformations that will be selected (i.e., bind to) chemicals, and classified as potential binding drug candidates, unlike the remaining non-binding drug candidate protein conformations. This work also addresses the class imbalance problem through advanced machine learning techniques that maximize the prediction rate of potential protein molecular conformations for the test case proteins ADORA2A (Adenosine A2a Receptor) and OPRK1 (Opioid Receptor Kappa 1), and subsequently reduces the failure rates and hastens the drug discovery process.Entities:
Keywords: ADORA2A; OPRK1; class imbalance; drug candidates; drug discovery; machine learning; protein conformation selecton
Mesh:
Substances:
Year: 2022 PMID: 35163865 PMCID: PMC8840520 DOI: 10.3390/molecules27030594
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Protein descriptors for ADORA2A and OPRK1 datasets.
| Protein Property | Description |
|---|---|
| pro_mass | Protein Mass |
| pro_pI_3D | Structure-based pI Prediction |
| pro_coeff_fric | Frictional Coefficient |
| pro_coeff_diff | Diffusion coefficient |
| pro_r_gyr | Radius of Gyration |
| pro_r_solv | Hydrodynamic Radius |
| pro_sed_const | Sedimentation Constant |
| pro_eccen | Protein Eccentricity |
| pro_asa_vdw | Water Accessible Surface Area |
| pro_asa_hyd | Hydrophobic Surface Area |
| pro_asa_hph | Hydrophilic Surface Area |
| pro_volume | Protein Volume |
| pro_mobility | Protein Mobility |
| pro_helicity | Protein Helix Ratio |
| pro_henry | Henry’s Function f(ka) |
| pro_net_charge | Protein Net Charge |
| pro_app_charge | Protein Charge at Debye Length |
| pro_dipole_moment | Protein Dipole Moment |
| pro_hyd_moment | Hydrophobicity moment |
| pro_zeta | Zeta Potential |
| pro_zdipole | Zeta Dipole Moment |
| pro_zquadrupole | Zeta Quadrupole Moment |
| pro_patch_hyd | Area of hydrophobic protein patch(es) |
| pro_patch_hyd_1 | Area of largest hydrophobic protein patch(es) |
| pro_patch_hyd_2 | Area of 2 largest hydrophobic protein patch(es) |
| pro_patch_hyd_3 | Area of 3 largest hydrophobic protein patch(es) |
| pro_patch_hyd_4 | Area of 4 largest hydrophobic protein patch(es) |
| pro_patch_hyd_5 | Area of 5 largest hydrophobic protein patch(es) |
| pro_patch_hyd_n | Count of hydrophobic protein patch(es) |
| pro_patch_ion | Area of ionic protein patch(es) |
| pro_patch_ion_1 | Area of largest ionic protein patch(es) |
| pro_patch_ion_2 | Area of 2 largest ionic protein patch(es) |
| pro_patch_ion_3 | Area of 3 largest ionic protein patch(es) |
| pro_patch_ion_4 | Area of 4 largest ionic protein patch(es) |
| pro_patch_ion_5 | Area of 5 largest ionic protein patch(es) |
| pro_patch_ion_n | Count of ionic protein patch(es) |
| pro_patch_neg | Area of negative protein patch(es) |
| pro_patch_neg_1 | Area of largest negative protein patch(es) |
| pro_patch_neg_2 | Area of 2 largest negative protein patch(es) |
| pro_patch_neg_3 | Area of 3 largest negative protein patch(es) |
| pro_patch_neg_4 | Area of 4 largest negative protein patch(es) |
| pro_patch_neg_5 | Area of 5 largest negative protein patch(es) |
| pro_patch_neg_n | Count of negative protein patch(es) |
| pro_patch_pos | Area of positive protein patch(es) |
| pro_patch_pos_1 | Area of largest positive protein patch(es) |
| pro_patch_pos_2 | Area of 2 largest positive protein patch(es) |
| pro_patch_pos_3 | Area of 3 largest positive protein patch(es) |
| pro_patch_pos_4 | Area of 4 largest positive protein patch(es) |
| pro_patch_pos_5 | Area of 5 largest positive protein patch(es) |
| pro_patch_pos_n | Count of positive protein patch(es) |
Figure 1Flowchart of the proposed two-stage sampling-based classifier.
Number of class 0 samples and class 1 samples in the original training dataset.
| Training | Class 0 Samples | Class 1 Samples | ||
|---|---|---|---|---|
| Size | Training | Testing | Training | Testing |
| 10 | 202 | 1945 | 97 | 753 |
| 20 | 413 | 1734 | 186 | 664 |
| 30 | 628 | 1519 | 271 | 579 |
Classification performance of Methodology 1-LR on the original training dataset.
| Training | Methodology 1 Classifier—LR | ||||||
|---|---|---|---|---|---|---|---|
| Size |
|
|
|
|
|
|
|
| 10 | 222 | 1577 | 531 | 368 | 66.7 | 0.29 | 0.81 |
| 20 | 188 | 1449 | 476 | 285 | 68.2 | 0.28 | 0.83 |
| 30 | 151 | 1313 | 428 | 206 | 69.7 | 0.26 | 0.864 |
Classification performance of Methodology 2—GB and KNN on the original training dataset.
| Training | Methodology 2—GB | ||||||
|---|---|---|---|---|---|---|---|
| Size |
|
|
|
|
|
|
|
| 10 | 463 | 1185 | 290 | 760 | 61.08 | 0.614 | 0.6 |
| 20 | 331 | 1165 | 333 | 569 | 62.3 | 0.49 | 0.67 |
| 30 | 279 | 1059 | 300 | 460 | 63.7 | 0.48 | 0.69 |
|
|
| ||||||
|
|
|
|
|
|
|
|
|
| 10 | 224 | 1578 | 529 | 367 | 66.7 | 0.29 | 0.81 |
| 20 | 188 | 1424 | 476 | 310 | 67.2 | 0.28 | 0.82 |
| 30 | 149 | 1245 | 430 | 274 | 66.4 | 0.25 | 0.81 |
Number of class 0 samples and class 1 samples in the new training dataset.
| Training | Class 0 Samples | Class 1 Samples | ||
|---|---|---|---|---|
| Size | Training | Testing | Training | Testing |
| 10 | 97 | 1945 | 202 | 753 |
| 20 | 186 | 1734 | 413 | 664 |
| 30 | 271 | 1519 | 628 | 579 |
Classification performance of SMOTE-GB and SMOTE-KNN on the new training dataset.
| Training | Methodology 2 Classifier—SMOTE-GB | ||||||
|---|---|---|---|---|---|---|---|
| Size |
|
|
|
|
|
|
|
| 10 | 479 | 1114 | 274 | 831 | 59.04 | 0.636 | 0.572 |
| 20 | 420 | 943 | 244 | 791 | 56.8 | 0.63 | 0.54 |
| 30 | 384 | 800 | 195 | 719 | 56.4 | 0.66 | 0.52 |
|
|
| ||||||
|
|
|
|
|
|
|
|
|
| 10 | 632 | 504 | 121 | 1441 | 42.1 | 0.839 | 0.25 |
| 20 | 592 | 408 | 72 | 408 | 41.7 | 0.89 | 0.23 |
| 30 | 504 | 373 | 75 | 1146 | 41.8 | 0.87 | 0.24 |
Reconfirmation and identification of new TP by Methodology 2—SMOTE-GB and SMOTE-KNN.
| Training | Methodology 2—SMOTE-GB | ||
|---|---|---|---|
| Size | |||
| 10 | 298 | 84 | 181 |
| 20 | 262 | 46 | 158 |
| 30 | 250 | 22 | 134 |
|
|
| ||
|
| |||
| 10 | 427 | 28 | 205 |
| 20 | 413 | 26 | 179 |
| 30 | 364 | 11 | 140 |
Decision fusion of Methodology 1 and Methodology 2: LR+SMOTE-GB and LR+SMOTE-KNN.
| Decision Fusion: LR+SMOTE-GB | |||
|---|---|---|---|
| %Training Size | Total Accuracy |
|
|
| 10 | 80.8 | 69 | 85.3 |
| 20 | 81.1 | 67.7 | 86.2 |
| 30 | 82.7 | 69.2 | 87.8 |
|
| |||
|
|
|
|
|
| 10 | 83.5 | 86.1 | 82.5 |
| 20 | 86.5 | 90.5 | 85 |
| 30 | 87.6 | 88.9 | 87.1 |
Figure 2Plot of overall classification accuracies across varying training sizes for case 1, case 2, and decision fusion for protein ADORA2A.
Figure 3Plot of TP accuracies across varying training sizes for case 1, case 2, and decision fusion for protein ADORA2A.
Figure 4Plot of TN accuracies across varying training sizes for case 1, case 2, and decision fusion for protein ADORA2A.
Number of class 0 samples and class 1 samples in the original training dataset.
| Training | Class 0 Samples | Class 1 Samples | ||
|---|---|---|---|---|
| Size | Training | Testing | Training | Testing |
| 10 | 289 | 2573 | 10 | 127 |
| 20 | 574 | 2288 | 25 | 112 |
| 30 | 858 | 2004 | 41 | 96 |
Classification performance of Methodology 1-LR on the original training dataset.
| Training | Methodology1 Classifier—LR | ||||||
|---|---|---|---|---|---|---|---|
| Size |
|
|
|
|
|
|
|
| 10 | 3 | 2531 | 124 | 42 | 93.8 | 0.02 | 0.98 |
| 20 | 1 | 2282 | 111 | 6 | 95.1 | 0 | 0.99 |
| 30 | 0 | 2001 | 96 | 3 | 95.2 | 0 | 0.99 |
Classification performance of Methodology 2—GB and KNN on the original training dataset.
| Training | Methodology 2- GB | ||||||
|---|---|---|---|---|---|---|---|
| Size |
|
|
|
|
|
|
|
| 10 | 27 | 1983 | 100 | 590 | 74.44 | 0.21 | 0.77 |
| 20 | 45 | 1186 | 67 | 1102 | 51.2 | 0.4 | 0.51 |
| 30 | 51 | 804 | 45 | 1200 | 40.7 | 0.5 | 0.40 |
|
|
| ||||||
|
|
|
|
|
|
|
|
|
| 10 | 0 | 2573 | 127 | 0 | 95.2 | 0 | 1 |
| 20 | 0 | 2287 | 112 | 1 | 95.2 | 0 | 0.99 |
| 30 | 0 | 2004 | 96 | 0 | 95.4 | 0 | 1 |
Number of class 0 samples and class 1 samples in the new training dataset.
| Training | Class 0 Samples | Class 1 Samples | ||
|---|---|---|---|---|
| Size | Training | Testing | Training | Testing |
| 10 | 41 | 2004 | 858 | 96 |
| 20 | 25 | 2288 | 574 | 112 |
| 30 | 41 | 2004 | 858 | 96 |
Classification performance of SMOTE-GB and SMOTE-KNN on the new training dataset.
| Training | Methodology 2 Classifier—SMOTE-GB | ||||||
|---|---|---|---|---|---|---|---|
| Size |
|
|
|
|
|
|
|
| 10 | 59 | 1420 | 68 | 1153 | 54.7 | 0.46 | 0.55 |
| 20 | 81 | 498 | 31 | 1790 | 24.1 | 0.72 | 0.21 |
| 30 | 61 | 602 | 35 | 1402 | 31.5 | 0.63 | 0.30 |
|
|
| ||||||
|
|
|
|
|
|
|
|
|
| 10 | 125 | 68 | 2 | 2505 | 7 | 0.98 | 0.02 |
| 20 | 109 | 60 | 3 | 2228 | 7 | 0.97 | 0.02 |
| 30 | 93 | 48 | 3 | 1956 | 6.7 | 0.96 | 0.02 |
Reconfirmation and identification of new TP by Methodology 2—SMOTE-GB and SMOTE-KNN.
| Training | Methodology 2—SMOTE-GB | ||
|---|---|---|---|
| Size |
|
| |
| 10 | 56 | 13 | 3 |
| 20 | 80 | 0 | 1 |
| 30 | 61 | 2 | 0 |
|
|
| ||
|
|
|
| |
| 10 | 122 | 0 | 3 |
| 20 | 108 | 1 | 1 |
| 30 | 93 | 0 | 0 |
Decision fusion of Methodology 1 and Methodology 2: LR+SMOTE-GB and LR+SMOTE-KNN.
| Decision Fusion: LR+SMOTE-GB | |||
|---|---|---|---|
| %Training Size | Total Accuracy |
|
|
| 10 | 96.4 | 46.4 | 98.8 |
| 20 | 98.4 | 72.3 | 99.7 |
| 30 | 98.2 | 63.5 | 99.9 |
|
| |||
|
|
|
|
|
| 10 | 98.3 | 98.4 | 98.3 |
| 20 | 99.6 | 97.3 | 99.7 |
| 30 | 99.7 | 96.8 | 99.8 |
Figure 5Plot of overall classification accuracies across varying training sizes for case 1, case 2, and decision fusion for protein OPRK1.
Figure 6Plot of TP accuracies across varying training sizes for case 1, case 2, and decision fusion for protein OPRK1.
Figure 7Plot of TN accuracies across varying training sizes for case 1, case 2, and decision fusion for protein OPRK1.
Figure 8Plot of ROC curve for ADORA2A protein for a training size of 30%.
AUC Score and F1 Score of the proposed ML methodologies for ADORA2A.
| Proposed ML Methodologies | AUC Score | F1 Score |
|---|---|---|
| LR | 0.649 | 0.323 |
| SMOTE-GB | 0.638 | 0.456 |
| SMOTE-KNN | 0.588 | 0.452 |
| LR+SMOTE-GB | 0.638 | 0.684 |
| LR+SMOTE-KNN | 0.590 | 0.792 |
Figure 9Plot of ROC curve for OPRK1 protein for a training size of 30%.
AUC Score and F1 score of the proposed ML methodologies for OPRK1.
| Proposed ML Methodologies | AUC Score | F1 Score |
|---|---|---|
| LR | 0.496 | 0.030 |
| SMOTE-GB | 0.475 | 0.078 |
| SMOTE-KNN | 0.470 | 0.086 |
| LR+SMOTE-GB | 0.475 | 0.770 |
| LR+SMOTE-KNN | 0.470 | 0.968 |