| Literature DB >> 30023899 |
Raquel Rodríguez-Pérez1, Tomoyuki Miyao1, Swarit Jasial1, Martin Vogt1, Jürgen Bajorath1.
Abstract
Screening of compound libraries against panels of targets yields profiling matrices. Such matrices typically contain structurally diverse screening compounds, large numbers of inactives, and small numbers of hits per assay. As such, they represent interesting and challenging test cases for computational screening and activity predictions. In this work, modeling of large compound profiling matrices was attempted that were extracted from publicly available screening data. Different machine learning methods including deep learning were compared and different prediction strategies explored. Prediction accuracy varied for assays with different numbers of active compounds, and alternative machine learning approaches often produced comparable results. Deep learning did not further increase the prediction accuracy of standard methods such as random forests or support vector machines. Target-based random forest models were prioritized and yielded successful predictions of active compounds for many assays.Entities:
Year: 2018 PMID: 30023899 PMCID: PMC6045364 DOI: 10.1021/acsomega.8b00462
Source DB: PubMed Journal: ACS Omega ISSN: 2470-1343
Assays and Targetsa
| assay ID | assay code | target name | organism | # active CPDs (matrix 2 training) | # active CPDs (matrix 2 test) | # active CPDs (matrix 1) |
|---|---|---|---|---|---|---|
| 485313 | A | Niemann-pick C1 protein precursor | 3103 | 3142 | 395 | |
| 485314 | B | DNA polymerase β | 1325 | 1326 | 125 | |
| 485341 | C | β-lactamase | 458 | 478 | 420 | |
| 485349 | D | serine-protein kinase ATM isoform 1 | 191 | 175 | 118 | |
| 485367 | E | ATP-dependent phosphofructokinase | 152 | 138 | 103 | |
| 504466 | F | ATPase family AAA domain-containing protein 5 | 1624 | 1586 | 424 | |
| 588590 | G | DNA polymerase iota | 885 | 868 | 103 | |
| 588591 | H | DNA polymerase eta | 1123 | 1129 | 39 | |
| 624171 | I | nuclear factor erythroid 2-related factor 2 | 367 | 391 | 118 | |
| 624330 | J | Rac GTPase-activating protein 1 | 491 | 536 | 156 | |
| 1721 | K | pyruvate kinase | 433 | 425 | 39 | |
| 1903 | L | large T antigen | Simian virus 40 | 275 | 248 | 57 |
| 2101 | M | glucocerebrosidase | 73 | 58 | 41 | |
| 2517 | N | AP endonuclease 1 | 197 | 199 | 32 | |
| 2528 | O | Bloom syndrome protein | 137 | 128 | 8 | |
| 2662 | P | histone-lysine | 10 | 15 | 3 | |
| 2676 | Q | relaxin/insulin-like family peptide receptor 1 | 215 | 195 | 223 | |
| 463254 | R | ubiquitin carboxyl-terminal hydrolase 2 isoform a | 4 | 4 | 2 | |
| 485297 | S | Ras-related protein Rab-9A | 3751 | 3810 | 410 | |
| 488837 | T | ryes absent homolog 2 isoform a | 2 | 7 | 1 | |
| 492947 | U | β-2 adrenergic receptor | 25 | 28 | 4 | |
| 504327 | V | histone acetyltransferase KAT2A | 158 | 141 | 50 | |
| 504329 | W | nonstructural protein 1 | influenza A virus | 213 | 205 | 64 |
| 504339 | X | lysine-specific demethylase 4A | 4755 | 4757 | 1320 | |
| 504842 | Y | chaperonin-containing TCP-1 β subunit homolog | 28 | 20 | 13 | |
| 504845 | Z | regulator of G-protein signaling 4 | 9 | 7 | 1 | |
| 504847 | AA | vitamin D3 receptor isoform VDRA | 772 | 771 | 48 | |
| 540317 | AB | chromobox protein homolog 1 | 442 | 449 | 98 | |
| 588579 | AC | DNA polymerase kappa | 354 | 362 | 6 | |
| 588689 | AD | genome polyprotein | dengue virus type 2 | 180 | 184 | 6 |
| 588795 | AE | flap endonuclease 1 | 175 | 210 | 17 | |
| 602179 | AF | isocitrate dehydrogenase 1 | 75 | 81 | 28 | |
| 602233 | AG | phosphoglycerate kinase | 28 | 40 | 1 | |
| 602310 | AH | DNA dC->dU-editing enzyme APOBEC-3G | 60 | 66 | 11 | |
| 602313 | AI | DNA dC->dU-editing enzyme APOBEC-3F isoform a | 202 | 183 | 28 | |
| 602332 | AJ | heat shock 70 kDa protein 5 | 15 | 15 | 6 | |
| 624170 | AK | glutaminase kidney isoform | 162 | 186 | 65 | |
| 624172 | AL | glucagon-like peptide 1 receptor | 7 | 7 | 2 | |
| 624173 | AM | hypothetical protein | 136 | 141 | 32 | |
| 624202 | AN | breast cancer type 1 susceptibility protein | 1469 | 1484 | 275 | |
| 651644 | AO | viral protein r | human immunodeficiency virus 1 | 208 | 209 | 74 |
| 651768 | AP | Werner syndrome ATP-dependent helicase | 278 | 325 | 5 | |
| 652106 | AQ | α-synuclein | 111 | 102 | 57 | |
| 720504 | AR | serine/threonine-protein kinase PLK1 | 3357 | 3308 | 662 | |
| 720542 | AS | apical membrane antigen 1 | 93 | 98 | 25 | |
| 720707 | AT | Rap guanine nucleotide exchange factor 3 | 50 | 62 | 3 | |
| 720711 | AU | Rap guanine nucleotide exchange factor 4 | 59 | 68 | 16 | |
| 743255 | AV | ubiquitin carboxyl-terminal hydrolase 2 isoform a | 147 | 149 | 15 | |
| 743266 | AW | parathyroid hormone 1 receptor | 66 | 70 | 79 | |
| 493005 | AX | Tumor susceptibility gene 101 protein | 0 | 0 | 0 | |
| 504891 | AY | peptidyl-prolyl cis–trans isomerase NIMA-interacting 1 | 6 | 5 | 0 | |
| 504937 | AZ | sphingomyelin phosphodiesterase | 5 | 9 | 0 | |
| 588456 | BA | thioredoxin reductase | Rattus norvegicus | 1 | 8 | 0 |
Reported are the PubChem assay IDs, codes used here, targets, and organisms, for all 53 assays. In addition, for each assay, numbers of active compounds in the matrix 2 training and test sets and in matrix 1 are reported.
Matrix Compositiona
| matrix 1 | matrix 2 | |
|---|---|---|
| density | 100% | 96.4% |
| # compounds (CPDs) | 109 925 | 143 310 |
| # assays | 53 | 53 |
| percentage of active cells | 0.1% | 0.8% |
| # consistently inactive CPDs | 105 475 (96%) | 110 218 (76.9%) |
| # CPDs with single-target activity | 3639 (3.3%) | 19 069 (13.3%) |
| # CPDs with multitarget activity | 811 (0.7%) | 14 023 (9.8%) |
For matrix 1 and matrix 2, the density, number of compounds and assays, percentage of cells with activity annotations (active cells), number of consistently inactive compounds, and number of compounds with single- and multitarget activity are reported.
Figure 1Exemplary active compounds. Shown are exemplary active compounds from two matrix 1 assays for (a) DNA polymerase β (assay code B) and (b) serine-protein kinase ATM isoform 1 (code D), respectively.
Figure 2Pairwise Tanimoto similarity. The heat map reports mean pairwise Tanimoto similarity for active compounds from matrix 1. The extended connectivity fingerprint with bond diameter 4 (ECFP4; see Materials and Methods) was used as a molecular representation.
Figure 3Receiver operating characteristic curves for global models. Receiver operating characteristic (ROC) curves are shown for SVM (red), RF (green), and DNN (blue) global models, which were trained with half of matrix 2 and used to predict the other half of matrix 2 (right) and matrix 1 (left).
Area under the Curve Values for Prediction of 10 Assays of the Matrix 2 Test Seta
| assay code | CCBM | NB | RF | SVM | single-task DNN | multitask DNN | GraphConv |
|---|---|---|---|---|---|---|---|
| A | 0.85 | 0.84 | 0.91 | 0.91 | 0.91 | 0.90 | |
| B | 0.77 | 0.79 | 0.82 | 0.82 | 0.83 | ||
| C | 0.64 | 0.71 | 0.72 | 0.69 | 0.67 | 0.72 | |
| D | 0.63 | 0.69 | 0.65 | 0.67 | 0.62 | 0.64 | |
| E | 0.81 | 0.82 | 0.84 | 0.84 | 0.85 | 0.85 | |
| F | 0.82 | 0.82 | 0.87 | 0.87 | 0.86 | ||
| G | 0.73 | 0.79 | 0.81 | 0.79 | 0.82 | ||
| H | 0.80 | 0.85 | 0.88 | 0.87 | 0.89 | ||
| I | 0.80 | 0.85 | 0.88 | 0.85 | |||
| J | 0.84 | 0.87 | 0.91 | 0.86 |
Reported are AUC values for prediction of 10 assays (codes A–J) using different machine learning methods. For each assay, best results are indicated in bold.
Area under the Curve Values for Prediction of 10 Assays of Matrix 1a
| assay code | CCBM | NB | RF | SVM | single-task DNN | multitask DNN | GraphConv |
|---|---|---|---|---|---|---|---|
| A | 0.88 | 0.86 | 0.93 | 0.93 | 0.93 | 0.92 | |
| B | 0.64 | 0.68 | 0.69 | 0.67 | 0.66 | 0.69 | |
| C | 0.66 | 0.64 | 0.67 | 0.64 | 0.64 | 0.68 | |
| D | 0.62 | 0.63 | 0.62 | 0.62 | 0.63 | 0.60 | |
| E | 0.86 | 0.91 | 0.91 | 0.90 | 0.88 | 0.89 | |
| F | 0.82 | 0.82 | 0.87 | 0.87 | 0.86 | 0.87 | |
| G | 0.55 | 0.55 | 0.58 | 0.57 | 0.54 | 0.57 | |
| H | 0.70 | 0.75 | 0.76 | 0.74 | 0.75 | 0.76 | |
| I | 0.82 | 0.86 | 0.88 | 0.86 | 0.83 | 0.88 | |
| J | 0.84 | 0.88 | 0.93 | 0.93 | 0.90 |
Reported are AUC values for prediction of 10 assays (codes A–J) using different machine learning methods. For each assay, best results are indicated in bold.
Figure 4Per-target receiver operating characteristic curves. ROC curves are shown for target-based activity predictions with RF (green), multitask DNN (orange), and CCBM (pink) models. Curves represent 10 matrix 1 assays used for method comparisons. Codes A–J designate assays according to Table .
Comparison of Different Molecular Representationsa
| assay code | MOE | MACCS | MOE + fold. ECFP4 | nonfolded ECFP4 | folded ECFP4 |
|---|---|---|---|---|---|
| A | 0.91 | 0.90 | |||
| B | 0.65 | 0.64 | 0.68 | ||
| C | 0.66 | 0.67 | 0.68 | ||
| D | 0.59 | 0.60 | 0.63 | 0.62 | |
| E | 0.86 | 0.84 | 0.90 | 0.93 | |
| F | 0.86 | 0.84 | |||
| G | 0.58 | 0.56 | 0.57 | 0.58 | |
| H | 0.76 | 0.73 | 0.76 | ||
| I | 0.85 | 0.86 | 0.87 | 0.89 | |
| J | 0.90 | 0.92 |
Reported are AUC values for prediction of 10 assays (codes A–J) in matrix 1 using per-target RF models on the basis of different molecular representations, including 192 two-dimensional (2D) descriptors from the Molecular Operating Environment (MOE), 166 MACCS structural keys, the folded and unfolded version of ECFP4, and the combination of MOE descriptors and folded ECFP4 (MOE + fold. ECFP4). For each assay, best results are indicated in bold.
Figure 5Area under the curve values for per-target models trained with half of matrix 2. AUC values are reported for predictions of compounds active in assays of the matrix 2 test set (blue) and matrix 1 (red).
Figure 6Area under the curve values for per-target models trained with matrix 2. AUC values are reported for predictions of compounds active in assays of matrix 1.
Recall of Active Compounds in the Top 1% of Ranked Matrix 1a
| assay code | # active CPDs in matrix 1 | # active CPDs in top 1% | recall (%) | rank of first active CPD |
|---|---|---|---|---|
| X | 1320 | 383 | 29 | 1 |
| S | 410 | 209 | 51 | 2 |
| A | 395 | 208 | 53 | 1 |
| F | 424 | 161 | 38 | 1 |
| Q | 223 | 120 | 54 | 1 |
| J | 156 | 113 | 72 | 2 |
| AN | 275 | 80 | 29 | 1 |
| E | 103 | 63 | 61 | 1 |
| AR | 662 | 59 | 9 | 1 |
| C | 420 | 56 | 13 | 4 |
| AO | 74 | 52 | 70 | 1 |
| W | 64 | 49 | 77 | 1 |
| L | 57 | 43 | 75 | 1 |
| AB | 98 | 36 | 37 | 5 |
| I | 118 | 35 | 30 | 10 |
| AM | 32 | 30 | 94 | 1 |
| M | 41 | 28 | 68 | 2 |
| K | 39 | 26 | 67 | 2 |
| AK | 65 | 25 | 38 | 3 |
| AQ | 57 | 17 | 30 | 7 |
| D | 118 | 16 | 14 | 1 |
| B | 125 | 15 | 12 | 2 |
| AA | 48 | 15 | 31 | 1 |
| AF | 28 | 15 | 54 | 1 |
| G | 103 | 7 | 7 | 42 |
| H | 39 | 7 | 18 | 39 |
| AE | 17 | 7 | 41 | 4 |
| AS | 25 | 7 | 28 | 1 |
| AV | 15 | 6 | 40 | 11 |
| N | 32 | 5 | 16 | 5 |
| Y | 13 | 5 | 38 | 1 |
| AI | 28 | 5 | 18 | 19 |
| AD | 6 | 3 | 50 | 72 |
| V | 50 | 2 | 4 | 192 |
| AJ | 6 | 2 | 33 | 88 |
| T | 1 | 1 | 100 | 38 |
| Z | 1 | 1 | 100 | 368 |
| AG | 1 | 1 | 100 | 146 |
| AH | 11 | 1 | 9 | 32 |
| AU | 16 | 1 | 6 | 249 |
| AW | 79 | 1 | 1 | 573 |
| O | 8 | 0 | 0 | 6758 |
| P | 3 | 0 | 0 | 12 637 |
| R | 2 | 0 | 0 | 31 266 |
| U | 4 | 0 | 0 | 1805 |
| AC | 6 | 0 | 0 | 2012 |
| AL | 2 | 0 | 0 | 26 430 |
| AP | 5 | 0 | 0 | 1156 |
| AT | 3 | 0 | 0 | 36 085 |
For each assay, the number of active compounds in matrix 1, their recall in the top 1% of the ranking, and the highest-ranked active for RF models trained with matrix 2 are reported.