| Literature DB >> 35335117 |
Mengzhou Bi1, Zhen Guan2, Tengjiao Fan1,3, Na Zhang1, Jianhua Wang2, Guohui Sun1, Lijiao Zhao1, Rugang Zhong1.
Abstract
Dual-specific tyrosine phosphorylation regulated kinase 1 (DYRK1A) has been regarded as a potential therapeutic target of neurodegenerative diseases, and considerable progress has been made in the discovery of DYRK1A inhibitors. Identification of pharmacophoric fragments provides valuable information for structure- and fragment-based design of potent and selective DYRK1A inhibitors. In this study, seven machine learning methods along with five molecular fingerprints were employed to develop qualitative classification models of DYRK1A inhibitors, which were evaluated by cross-validation, test set, and external validation set with four performance indicators of predictive classification accuracy (CA), the area under receiver operating characteristic (AUC), Matthews correlation coefficient (MCC), and balanced accuracy (BA). The PubChem fingerprint-support vector machine model (CA = 0.909, AUC = 0.933, MCC = 0.717, BA = 0.855) and PubChem fingerprint along with the artificial neural model (CA = 0.862, AUC = 0.911, MCC = 0.705, BA = 0.870) were considered as the optimal modes for training set and test set, respectively. A hybrid data balancing method SMOTETL, a combination of synthetic minority over-sampling technique (SMOTE) and Tomek link (TL) algorithms, was applied to explore the impact of balanced learning on the performance of models. Based on the frequency analysis and information gain, pharmacophoric fragments related to DYRK1A inhibition were also identified. All the results will provide theoretical supports and clues for the screening and design of novel DYRK1A inhibitors.Entities:
Keywords: DYRK1A; classification models; heterocyclic inhibitors; pharmacophoric fragments
Mesh:
Year: 2022 PMID: 35335117 PMCID: PMC8954712 DOI: 10.3390/molecules27061753
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1(A) Distributions of experimental pIC50 values for the dataset ( 117, grey bars), training set (88, green bars), and test set (29, blue bars). (B) Heat map of the molecular similarity constructed by Euclidian distance metrics for the entire dataset. (C) Chemical space of the training set (blue dots) and test set (red dots) using top three principal components of dragon molecular descriptors (51% variance explained). (D) Radar map of the dataset with the parameters of Lipinski’s rules of five and the number of rotatable bonds.
Figure 2Performance of the training set with five-fold cross validation of 35 models. (A) AUC-CA histogram; (B) SE-SP histogram.
Performance of top 10 models for the training set, test set, and external validation set.
| Data Set | Model | AUC | CA | MCC |
|
|
|
| SE | SP | BA |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Training set | PubChemFP-SVM | 0.933 | 0.909 | 0.717 | 67 | 13 | 6 | 2 | 0.971 | 0.684 | 0.828 |
| SubFP-LR | 0.914 | 0.864 | 0.583 | 64 | 12 | 7 | 5 | 0.928 | 0.632 | 0.780 | |
| PubChemFP-NB | 0.908 | 0.807 | 0.508 | 57 | 14 | 5 | 12 | 0.826 | 0.737 | 0.782 | |
| PubChemFP-RF | 0.908 | 0.920 | 0.753 | 68 | 13 | 6 | 1 | 0.986 | 0.684 | 0.835 | |
| SubFP-ANN | 0.908 | 0.841 | 0.530 | 62 | 12 | 7 | 7 | 0.899 | 0.632 | 0.766 | |
| PubChemFP-LR | 0.904 | 0.920 | 0.755 | 67 | 14 | 5 | 2 | 0.971 | 0.737 | 0.854 | |
| MACCSFP-RF | 0.900 | 0.898 | 0.678 | 67 | 12 | 7 | 2 | 0.971 | 0.632 | 0.802 | |
| SubFP-Tree | 0.896 | 0.875 | 0.638 | 63 | 14 | 5 | 6 | 0.913 | 0.737 | 0.825 | |
| EStateFP-ANN | 0.893 | 0.852 | 0.556 | 63 | 12 | 7 | 6 | 0.913 | 0.632 | 0.773 | |
| PubChemFP-ANN | 0.893 | 0.909 | 0.743 | 64 | 16 | 3 | 5 | 0.928 | 0.842 | 0.885 | |
| Test set | PubChemFP-SVM | 0.911 | 0.862 | 0.705 | 17 | 8 | 1 | 3 | 0.850 | 0.889 | 0.870 |
| SubFP-LR | 0.903 | 0.793 | 0.493 | 18 | 5 | 4 | 2 | 0.900 | 0.556 | 0.728 | |
| PubChemFP-NB | 0.881 | 0.828 | 0.647 | 16 | 8 | 1 | 4 | 0.800 | 0.889 | 0.845 | |
| PubChemFP-RF | 0.917 | 0.897 | 0.761 | 20 | 6 | 3 | 0 | 1.000 | 0.667 | 0.834 | |
| SubFP-ANN | 0.881 | 0.793 | 0.517 | 17 | 6 | 3 | 3 | 0.850 | 0.667 | 0.759 | |
| PubChemFP-LR | 0.944 | 0.862 | 0.705 | 17 | 8 | 1 | 3 | 0.850 | 0.889 | 0.870 | |
| MACCSFP-RF | 0.922 | 0.862 | 0.680 | 20 | 5 | 4 | 0 | 1.000 | 0.556 | 0.778 | |
| SubFP-Tree | 0.825 | 0.862 | 0.517 | 17 | 6 | 3 | 3 | 0.850 | 0.667 | 0.759 | |
| EStateFP-ANN | 0.858 | 0.793 | 0.517 | 17 | 6 | 3 | 3 | 0.850 | 0.667 | 0.759 | |
| PubChemFP-ANN | 0.911 | 0.862 | 0.705 | 17 | 8 | 1 | 3 | 0.850 | 0.889 | 0.870 | |
| Validation set | PubChemFP-SVM | 0.660 | 0.667 | 0.213 | 8 | 2 | 3 | 2 | 0.800 | 0.400 | 0.600 |
| SubFP-LR | 0.780 | 0.667 | 0.139 | 9 | 1 | 4 | 1 | 0.900 | 0.200 | 0.550 | |
| PubChemFP-NB | 0.660 | 0.667 | 0.378 | 6 | 4 | 1 | 4 | 0.600 | 0.400 | 0.500 | |
| PubChemFP-RF | 0.430 | 0.600 | −0.189 | 9 | 0 | 5 | 1 | 0.600 | 0.000 | 0.300 | |
| SubFP-ANN | 0.760 | 0.733 | 0.354 | 9 | 2 | 3 | 1 | 0.900 | 0.400 | 0.650 | |
| PubChemFP-LR | 0.760 | 0.667 | 0.139 | 9 | 1 | 4 | 1 | 0.600 | 0.100 | 0.350 | |
| MACCSFP-RF | 0.820 | 0.667 | - | 10 | 0 | 5 | 0 | 1.000 | 0.000 | 0.500 | |
| SubFP-Tree | 0.600 | 0.667 | 0.213 | 8 | 2 | 3 | 2 | 0.800 | 0.400 | 0.600 | |
| EStateFP-ANN | 0.820 | 0.667 | 0.213 | 8 | 2 | 3 | 2 | 0.800 | 0.400 | 0.600 | |
| PubChemFP-ANN | 0.660 | 0.733 | 0.354 | 9 | 2 | 3 | 1 | 0.900 | 0.400 | 0.650 |
* True positive (TP), true negative (TN), false positive (FP), and false negative (FN).
Performance of top 10 balanced models for the training set, test set, and external validation set.
| Data Set | Model | AUC | CA | MCC |
|
|
|
| SE | SP | BA |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Training set | PubChemFP-LR | 0.993 | 0.948 | 0.896 | 63 | 64 | 3 | 4 | 0.940 | 0.955 | 0.948 |
| PubChemFP-SVM | 0.990 | 0.940 | 0.881 | 64 | 62 | 5 | 3 | 0.955 | 0.925 | 0.940 | |
| PubChemFP-ANN | 0.989 | 0.955 | 0.910 | 64 | 64 | 3 | 3 | 0.955 | 0.955 | 0.955 | |
| MACCSFP-RF | 0.984 | 0.954 | 0.908 | 62 | 62 | 3 | 3 | 0.954 | 0.954 | 0.954 | |
| PubChemFP-kNN | 0.983 | 0.948 | 0.896 | 62 | 65 | 2 | 5 | 0.925 | 0.970 | 0.948 | |
| PubChemFP-RF | 0.979 | 0.948 | 0.896 | 65 | 62 | 5 | 2 | 0.970 | 0.925 | 0.948 | |
| MACCSFP-LR | 0.974 | 0.954 | 0.908 | 62 | 62 | 3 | 3 | 0.954 | 0.954 | 0.954 | |
| MACCSFP-kNN | 0.972 | 0.954 | 0.908 | 61 | 63 | 2 | 4 | 0.938 | 0.969 | 0.954 | |
| SubFP-RF | 0.971 | 0.888 | 0.777 | 61 | 58 | 9 | 6 | 0.910 | 0.866 | 0.888 | |
| SubFP-LR | 0.971 | 0.881 | 0.761 | 59 | 59 | 8 | 8 | 0.881 | 0.881 | 0.881 | |
| Test set | PubChemFP-LR | 0.808 | 0.863 | 0.577 | 19 | 5 | 4 | 1 | 0.950 | 0.556 | 0.753 |
| PubChemFP-SVM | 0.883 | 0.828 | 0.680 | 20 | 5 | 4 | 0 | 1.000 | 0.556 | 0.778 | |
| PubChemFP-ANN | 0.836 | 0.828 | 0.493 | 18 | 5 | 4 | 2 | 0.900 | 0.556 | 0.728 | |
| MACCSFP-RF | 0.933 | 0.862 | 0.667 | 19 | 6 | 3 | 1 | 0.950 | 0.667 | 0.808 | |
| PubChemFP-kNN | 0.881 | 0.862 | 0.680 | 20 | 5 | 4 | 0 | 1.000 | 0.556 | 0.778 | |
| PubChemFP-RF | 0.895 | 0.862 | 0.697 | 20 | 7 | 3 | 1 | 0.952 | 0.700 | 0.826 | |
| MACCSFP-LR | 0.922 | 0.862 | 0.667 | 19 | 6 | 3 | 1 | 0.950 | 0.667 | 0.808 | |
| MACCSFP-kNN | 0.811 | 0.862 | 0.680 | 20 | 5 | 4 | 0 | 1.000 | 0.556 | 0.778 | |
| SubFP-RF | 0.872 | 0.862 | 0.680 | 20 | 5 | 4 | 0 | 1.000 | 0.556 | 0.778 | |
| SubFP-LR | 0.806 | 0.759 | 0.393 | 18 | 4 | 5 | 2 | 0.900 | 0.444 | 0.672 | |
| Validation set | PubChemFP-LR | 0.500 | 0.600 | 0.100 | 7 | 2 | 3 | 3 | 0.700 | 0.400 | 0.550 |
| PubChemFP-SVM | 0.400 | 0.600 | 0.100 | 7 | 2 | 3 | 3 | 0.700 | 0.400 | 0.550 | |
| PubChemFP-ANN | 0.560 | 0.533 | 0.000 | 6 | 2 | 3 | 4 | 0.600 | 0.400 | 0.500 | |
| MACCSFP-RF | 0.640 | 0.667 | 0.139 | 9 | 1 | 4 | 1 | 0.900 | 0.200 | 0.550 | |
| PubChemFP-kNN | 0.330 | 0.600 | −0.189 | 9 | 0 | 5 | 1 | 0.900 | 0.000 | 0.450 | |
| PubChemFP-RF | 0.300 | 0.600 | −0.189 | 9 | 0 | 5 | 1 | 0.900 | 0.000 | 0.450 | |
| MACCSFP-LR | 0.760 | 0.733 | 0.378 | 10 | 1 | 4 | 0 | 1.000 | 0.200 | 0.600 | |
| MACCSFP-kNN | 0.590 | 0.533 | −0.277 | 8 | 0 | 5 | 2 | 0.800 | 0.000 | 0.400 | |
| SubFP-RF | 0.760 | 0.773 | 0.378 | 10 | 1 | 4 | 0 | 1.000 | 0.200 | 0.600 | |
| SubFP-LR | 0.520 | 0.600 | −0.189 | 9 | 0 | 5 | 1 | 0.900 | 0.000 | 0.450 |
PubChem fingerprint-based privileged substructures responsible for DYR1KA inhibition.
| Fingerprints | Substructure | General Substructure | Representative Substructure | IG | FP | FN |
|---|---|---|---|---|---|---|
| PubchemFP187 | ≥2 saturated or aromatic nitrogen-containing ring size 6 |
|
| 0.088 | 1.315 (23) | 0 (0) |
| PubchemFP188 | ≥2 saturated or aromatic heteroatom-containing ring size 6 | 0.088 | 1.315 (23) | 0 (0) | ||
| PubchemFP260 | ≥3 hetero-aromatic rings |
|
| 0.067 | 1.292 (18) | 0 (0) |
| PubchemFP646 | O=C–N–C–[#1] |
|
| 0.063 | 1.315 (17) | 0 (0) |
| PubchemFP645 | O=C–N–C–C |
| 0.053 | 1.230 (29) | 0.270 (2) | |
| PubchemFP499 | N–C:C:N |
|
| 0.064 | 1.237 (32) | 0.246 (2) |
| PubchemFP547 | N–C:C-N | 0.064 | 1.237 (32) | 0.246 (2) | ||
| PubchemFP569 | N–C–C–N | 0.060 | 1.213 (36) | 0.321 (3) | ||
| PubchemFP611 | N–C–C–N–C | 0.060 | 1.213 (36) | 0.321 (3) | ||
| PubchemFP629 | S-C:C:C-N |
|
| 0.062 | 1.198 (41) | 0.371 (4) |
| PubchemFP658 | C–C–S–C–C |
|
| 0.062 | 1.198 (41) | 0.371 (4) |
| PubchemFP691 | O–C–C–C–C–C–N |
|
| 0.144 | 1.263 (49) | 0.164 (2) |
| PubchemFP702 | O–C–C–C–C–C–N–C | 0.144 | 1.263 (49) | 0.164 (2) | ||
| PubchemFP703 | O–C–C–C–C–C(N)–C | 0.139 | 1.262 (48) | 0.167 (2) | ||
| PubchemFP720 | Oc1ccc(S)cc1 |
|
| 0.103 | 1.253 (41) | 0.194 (2) |
| PubchemFP783 | OC1CCC(S)CC1 | 0.103 | 1.253 (41) | 0.194 (2) |
Figure 3Proposed binding modes of (A) compound 73 and (B) compound 23 with DYRK1A.
Figure 4Pharmacophoric fragments presented in the DYRK1A inhibitors (b27 and 8 h) and theoretical hits.
Figure 5Predicted binding modes of five theoretical hits, (A) CP1, (B) CP4, (C) CP11, (D) CP12 and (E) CP23 with DYRK1A.
The statistics of molecules in the datasets.
| Data Set | Potent Inhibitors | Non-Potent Inhibitors | Total |
|---|---|---|---|
| Train set | 69 | 19 | 88 |
| Test set | 20 | 9 | 29 |
| Validation set | 10 | 5 | 15 |
| Total | 99 | 33 | 132 |