| Literature DB >> 31824919 |
Jihyeun Lee1, Surendra Kumar1, Sang-Yoon Lee2, Sung Jean Park1, Mi-Hyun Kim1.
Abstract
S100A9 is a potential therapeutic target for various disease including prostate cancer, colorectal cancer, and Alzheimer's disease. However, the sparsity of atomic level data, such as protein-protein interaction of S100A9 with RAGE, TLR4/MD2, or CD147 (EMMPRIN) hinders the rational drug design of S100A9 inhibitors. Herein we first report predictive models of S100A9 inhibitory effect by applying machine learning classifiers on 2D-molecular descriptors. The models were optimized through feature selectors as well as classifiers to produce the top eight random forest models with robust predictability and high cost-effectiveness. Notably, optimal feature sets were obtained after the reduction of 2,798 features into dozens of features with the chopping of fingerprint bits. Moreover, the high efficiency of compact feature sets allowed us to further screen a large-scale dataset (over 6,000,000 compounds) within a week. Through a consensus vote of the top models, 46 hits (hit rate = 0.000713%) were identified as potential S100A9 inhibitors. We expect that our models will facilitate the drug discovery process by providing high predictive power as well as cost-reduction ability and give insights into designing novel drugs targeting S100A9.Entities:
Keywords: Alzheimer's disease; S100; classification; consensus vote; feature selection; ligand-based virtual screening; machine learning; random forest
Year: 2019 PMID: 31824919 PMCID: PMC6886474 DOI: 10.3389/fchem.2019.00779
Source DB: PubMed Journal: Front Chem ISSN: 2296-2646 Impact factor: 5.221
Figure 1Workflow depicting the process of the top classification model development.
Figure 2Three-dimensional principal component analysis(PCA) of hits and patent molecules. Patent 1, Patent 2, and Patent 3 refers to WO2015177367A1, WO2014184234A1, and WO2016042172A1.
The number of features.
| 1D/2D descriptors | 1,444 | 1,218 | 1,017 |
| Fingerprints | 1,354 | 855 | 855 |
| MACCSFP | 166 | 147 | 147 |
| PubChemFP | 881 | 598 | 598 |
| SubstructureFP | 307 | 110 | 110 |
| Total | 2,798 | 2,073 | 1,872 |
The numbers of descriptors and bits of fingerprints generated initially, and the selected numbers of features after the removal of unnecessary features by certain filtration methods are listed. Note that each bit of a fingerprint was considered as a single feature.
The optimized parameters for random forest models and the AUC of ROC values of test set prediction.
| BF | SET01 | 5 | 52 | 0.971 |
| SET02 | 6 | 42 | 0.961 | |
| SET03 | 10 | 215 | 0.956 | |
| SET04 | 10 | 203 | 0.912 | |
| SET05 | 2 | 15 | 1 | |
| GS | SET01 | 3 | 82 | 0.932 |
| SET02 | 10 | 164 | 0.935 | |
| SET03 | 10 | 112 | 0.948 | |
| SET04 | 6 | 105 | 0.867 | |
| SET05 | 3 | 36 | 1 | |
| PSOS | SET01 | 8 | 76 | 0.952 |
| SET02 | 5 | 45 | 0.915 | |
| SET03 | 9 | 185 | 0.952 | |
| SET04 | 10 | 84 | 0.882 | |
| SET05 | 4 | 13 | 1 | |
| SSFS | SET01 | 7 | 97 | 0.967 |
| SET02 | 6 | 59 | 0.966 | |
| SET03 | 9 | 236 | 0.963 | |
| SET04 | 8 | 245 | 0.896 | |
| SET05 | 2 | 50 | 1 | |
| None | SET01 | 5 | 25 | 0.954 |
| SET02 | 7 | 96 | 0.941 | |
| SET03 | 7 | 124 | 0.949 | |
| SET04 | 7 | 60 | 0.872 | |
| SET05 | 5 | 79 | 1 |
The number of molecules from eMolecules database in each subset.
| Subset01–subset09 | 100,000 |
| Subset10–subset18 | 200,000 |
| Subset19–subset32 | 250,000 |
| Subset33 | 247,184 |
| Total | 6,447,184 |
The number of selected features after each FS method.
| SET05 | 51 | 852 | 591 | 47 |
| SET01 | 37 | 940 | 552 | 29 |
| SET02 | 50 | 751 | 602 | 23 |
| SET03 | 66 | 741 | 600 | 24 |
| SET04 | 70 | 667 | 610 | 28 |
Figure 3The rates of feature reduction. The reduction rate is the ratio between the number of features removed after FS method and the original number of features before FS method.
Figure 4The composition of each feature set. The number of each kind of descriptor and fingerprint bit after each FS method is shown here. SET0N refers to the different IC50 threshold (SET01:4 μM; SET02:3 μM; SET03:2 μM; SET04:1 μM; SET05:11.4 μM). Note that the maximum value of horizontal axis of the graph differs between each FS method.
Figure 5The merits of feature sets after feature selection method in each dataset with different IC50 threshold.
Figure 6Heat-map depicting the AUC of ROC curve of the classification models.
Figure 7Heat-map depicting the MCC of the classification models.
Figure 8Prediction probability of top 8 models on patent molecules, decoy molecules, and hits. Each label of horizontal axis represents each top random forest model. The FS method and the dataset (SET0N) used in the model is indicated as “FS method_N”. For example, SSFS_3 refers to the random forest model built with the feature set chosen by SSFS with dataset SET03. *Note that every patent molecules were considered as active in BF_5 and SSFS_5.
Drug-likeness, ADME parameters prediction for 46 hits using QikProp and their Tanimoto similarity between the nearest neighbor.
| 1 | 438.81 | 4.64 | 2 | 5.25 | 6 | 83.28 | 0 | 1 | 883.32 | 6500.64 | 0.723 |
| 2 | 447.52 | 1.88 | 1 | 10.25 | 8 | 110.03 | 0 | 0 | 169.49 | 217.42 | 0.733 |
| 3 | 475.58 | 2.45 | 1 | 10.25 | 8 | 107.08 | 0 | 0 | 212.80 | 254.04 | 0.725 |
| 4 | 459.55 | 3.35 | 2 | 9 | 7 | 109.44 | 0 | 1 | 252.94 | 277.07 | 0.709 |
| 5 | 357.35 | 2.21 | 1 | 6.5 | 6 | 79.76 | 0 | 0 | 662.83 | 1251.4 | 0.803 |
| 6 | 475.54 | 2.06 | 2 | 12 | 10 | 152.60 | 0 | 0 | 74.61 | 50.07 | 0.686 |
| 7 | 369.37 | 2.66 | 2 | 5.5 | 7 | 112.07 | 0 | 0 | 123.62 | 120.63 | 0.823 |
| 8 | 489.56 | 2.35 | 2 | 12 | 10 | 152.60 | 0 | 1 | 74.59 | 50.05 | 0.685 |
| 9 | 463.57 | 2.06 | 3 | 9.5 | 9 | 138.29 | 0 | 0 | 31.69 | 26.88 | 0.763 |
| 10 | 399.25 | 2.08 | 1.25 | 7.75 | 7 | 102.73 | 0 | 0 | 310.52 | 651.41 | 0.831 |
| 11 | 394.81 | 2.23 | 1.25 | 7.75 | 7 | 102.66 | 0 | 0 | 338.55 | 628.45 | 0.757 |
| 12 | 376.82 | 1.93 | 1.25 | 7.75 | 7 | 103.25 | 0 | 0 | 371.75 | 355.25 | 0.767 |
| 13 | 394.81 | 2.10 | 1.25 | 7.75 | 7 | 103.88 | 0 | 0 | 336.23 | 499.24 | 0.757 |
| 14 | 449.54 | 2.32 | 3 | 9.5 | 9 | 134.36 | 0 | 0 | 78.22 | 41.92 | 0.711 |
| 15 | 463.57 | 2.57 | 2 | 9.75 | 8 | 110.80 | 0 | 0 | 245.98 | 297.80 | 0.708 |
| 16 | 396.39 | 3.56 | 1.25 | 5.25 | 6 | 93.24 | 0 | 1 | 414.29 | 1915.26 | 0.727 |
| 17 | 378.42 | 0.73 | 3 | 10 | 8 | 136.31 | 0 | 0 | 55.49 | 35.38 | 0.747 |
| 18 | 408.45 | 0.61 | 2 | 11.75 | 10 | 140.34 | 0 | 0 | 89.51 | 46.28 | 0.738 |
| 19 | 388.46 | 2.35 | 2 | 6.5 | 7 | 120.67 | 0 | 0 | 135.00 | 195.05 | 0.633 |
| 20 | 392.47 | 3.46 | 2 | 5.25 | 6 | 88.54 | 0 | 0 | 274.13 | 938.46 | 0.663 |
| 21 | 379.41 | 0.81 | 3 | 9.25 | 8 | 135.31 | 0 | 0 | 48.47 | 31.86 | 0.833 |
| 22 | 488.49 | 4.53 | 1 | 8.7 | 8 | 90.37 | 0 | 2 | 1243.59 | 3476.5 | 0.697 |
| 23 | 374.43 | 3.01 | 2.25 | 5.75 | 6 | 96.24 | 0 | 0 | 317.78 | 825.75 | 0.718 |
| 24 | 376.86 | 2.93 | 2.25 | 5.75 | 6 | 95.44 | 0 | 0 | 309.55 | 1326.69 | 0.721 |
| 25 | 376.86 | 2.92 | 2.25 | 5.75 | 6 | 95.45 | 0 | 0 | 344.43 | 1241.09 | 0.721 |
| 26 | 360.41 | 2.61 | 2.25 | 5.75 | 6 | 94.32 | 0 | 0 | 309.07 | 1006.01 | 0.721 |
| 27 | 424.44 | 3.73 | 2.25 | 5.75 | 6 | 98.39 | 0 | 0 | 240.41 | 1826.54 | 0.704 |
| 28 | 410.41 | 3.34 | 2.25 | 5.75 | 6 | 94.30 | 0 | 0 | 309.14 | 2445.21 | 0.706 |
| 29 | 394.81 | 1.97 | 1.25 | 7.75 | 7 | 101.41 | 0 | 0 | 336.23 | 484.57 | 0.757 |
| 30 | 397.52 | 2.51 | 2 | 8 | 6 | 90.69 | 0 | 0 | 823.95 | 875.59 | 0.753 |
| 31 | 378.46 | 2.15 | 2 | 7.7 | 7 | 110.06 | 0 | 0 | 286.99 | 254.97 | 0.776 |
| 32 | 382.82 | 2.67 | 1 | 6.75 | 8 | 111.06 | 0 | 0 | 226.09 | 236.12 | 0.783 |
| 33 | 427.46 | 0.74 | 1 | 11 | 10 | 129.95 | 0 | 0 | 146.33 | 120.44 | 0.747 |
| 34 | 410.41 | 2.18 | 2 | 9 | 7 | 115.79 | 0 | 0 | 223.42 | 411.72 | 0.744 |
| 35 | 410.41 | 2.15 | 2 | 9 | 7 | 116.40 | 0 | 0 | 182.76 | 361.39 | 0.744 |
| 36 | 357.79 | 1.93 | 2 | 7 | 6 | 92.65 | 0 | 0 | 276.14 | 510.66 | 0.828 |
| 37 | 357.79 | 1.93 | 2 | 7 | 6 | 93.28 | 0 | 0 | 259.64 | 489.39 | 0.828 |
| 38 | 357.79 | 1.86 | 2 | 7 | 6 | 93.79 | 0 | 0 | 240.46 | 428.65 | 0.828 |
| 39 | 371.81 | 2.37 | 1 | 7.5 | 6 | 81.06 | 0 | 0 | 544.41 | 1089.44 | 0.780 |
| 40 | 374.43 | 3.67 | 1.25 | 5.75 | 6 | 84.30 | 0 | 1 | 922.94 | 2328.24 | 0.750 |
| 41 | 370.79 | 1.97 | 2 | 6.5 | 7 | 111.17 | 0 | 0 | 122.26 | 206.45 | 0.759 |
| 42 | 399.87 | 3.14 | 1 | 7.5 | 6 | 82.38 | 0 | 0 | 625.21 | 1265.93 | 0.791 |
| 43 | 412.80 | 2.35 | 1.25 | 7.75 | 7 | 101.92 | 0 | 0 | 323.45 | 793.95 | 0.757 |
| 44 | 398.40 | 4.57 | 2 | 4.5 | 6 | 80.90 | 0 | 1 | 1275.28 | 3912.14 | 0.671 |
| 45 | 379.42 | 2.24 | 2 | 8.5 | 6 | 101.90 | 0 | 0 | 319.42 | 418.38 | 0.759 |
| 46 | 348.39 | 0.82 | 2 | 9.5 | 7 | 114.57 | 0 | 0 | 141.61 | 101.70 | 0.783 |
| Standard value | 130.0–725.0 | −2.0–6.5 | 0.0–6.0 | 2.0–20.0 | 2–15 | 7.0–200.0 | Maximum is 4 | Maximum is 3 | <25 poor, >500 great | <25 poor, >500 great |
Molecular weight.
Octanol/water partition coefficient.
Number of HB donors.
Number of HB acceptors.
Number of N and O atoms.
Polar surface area.
Number of violation of Lipinski's rule of five.
Number of violation of Jorgensen's rule of five.
Apparent Caco-2 cell permeability (nm/s).
Apparent MDCK cell permeability (nm/s).
Tanimoto coefficient of the entry between the nearest neighbor among 266 active molecules from patents.
Standard values from 95% of known drugs based on results of Qikprop.