| Literature DB >> 30597893 |
Tiago Dias1,2, Susana P Gaudêncio3,4, Florbela Pereira5.
Abstract
The risk of methicillin-resistant Staphylococcus aureus (MRSA) infection is increasing in both the developed and developing countries. New approaches to overcome this problem are in need. A ligand-based strategy to discover new inhibiting agents against MRSA infection was built through exploration of machine learning techniques. This strategy is based in two quantitative structure⁻activity relationship (QSAR) studies, one using molecular descriptors (approach A) and the other using descriptors (approach B). In the approach A, regression models were developed using a total of 6645 molecules that were extracted from the ChEMBL, PubChem and ZINC databases, and recent literature. The performance of the regression models was successfully evaluated by internal and external validation, the best model achieved R² of 0.68 and RMSE of 0.59 for the test set. In general natural product (NP) drug discovery is a time-consuming process and several strategies for dereplication have been developed to overcome this inherent limitation. In the approach B, we developed a new NP drug discovery methodology that consists in frontloading samples with 1D NMR descriptors to predict compounds with antibacterial activity prior to bioactivity screening for NPs discovery. The NMR QSAR classification models were built using 1D NMR data (¹H and 13C) as descriptors, from crude extracts, fractions and pure compounds obtained from actinobacteria isolated from marine sediments collected off the Madeira Archipelago. The overall predictability accuracies of the best model exceeded 77% for both training and test sets.Entities:
Keywords: NMR descriptors; antibacterial activity; drug discovery; machine learning (ML) techniques; marine natural products (MNPs); marine-derived actinobacteria; methicillin-resistant Staphylococcus aureus (MRSA); molecular descriptors; quantitative structure–activity relationship (QSAR)
Mesh:
Substances:
Year: 2018 PMID: 30597893 PMCID: PMC6356832 DOI: 10.3390/md17010016
Source DB: PubMed Journal: Mar Drugs ISSN: 1660-3397 Impact factor: 5.118
Structural clusters and pMIC values for anti-MRSA within the five clusters.
| Clusters 1 | Training Set 2 | Test Set 2 | Average/Maximum |
|---|---|---|---|
| 1123 | 324 | 4.72/7.80 | |
| 1657 | 517 | 4.59/7.81 | |
| 879 | 253 | 5.09/11.82 | |
| 913 | 270 | 5.12/9.00 | |
| 540 | 169 | 5.18/8.68 |
1 Cluster number and chemical structure of the cluster centroid. 2 Number of molecules. 3 Within the cluster for the training set.
Actinobacteria genera and correspondent anti-MRSA MIC values.
| Actinobacteria Genera | Set (Number/Sample Types) | Activity Class/Average MIC 1 |
|---|---|---|
|
| Tr set 2 (2, cr 4) | cr: inactive/>250 |
| Te set 3 (1, cr 4) | cr: inactive/>250 | |
|
| - | - |
| Te set 3 (1, cr 4) | cr: inactive/>250 | |
|
| Tr set 2 (23, 3 cr 4, 13 fr 5, 7 pu 6) | cr: inactive/>250 |
| Te set 3 (9, 3 cr 4, 3 fr 5, 3 pu 6) | cr: 2 active/35; 1 inactive/>250 | |
|
| Tr set 2 (29, 11 cr 4, 10 fr 5, 8 pu 6) | cr: 1 active/63; 10 inactive/>250 |
| Te set 3 (10, 4 cr 4, 4 fr 5, 2 pu 6) | cr: inactive/>250 | |
|
| Tr set 2 (62, 21 cr 4, 18 fr 5, 23 pu 6) | cr: 2 active/5; 19 inactive/243 |
| Te set 3 (18, 4 cr 4, 7 fr 5, 7 pu 6) | cr: inactive/219 |
1 μg/mL. 2 Training set. 3 Test set. 4 Crude extracts. 5 Fractions. 6 Pure compounds.
Exploration of two collections of empirical descriptors for the quantitative structure–activity relationship (QSAR) Random Forests (RF) model of pMIC against MRSA for the training set with an OOB estimation.
| Descriptors (#) | R2 | RMSE 1 | MAE 2 | % Error ≥ 0.5/% Error < 0.5 3 |
|---|---|---|---|---|
| MACCS (166) 4 |
|
|
|
|
| Sub (307) 4 | 0.509 | 0.681 | 0.505 | 39/61 |
| SubC (307) 4 |
|
|
|
|
| PubChem (881) 4 |
|
|
|
|
| CDK (1024) 4 | 0.551 | 0.652 | 0.471 | 36/64 |
| CDK Ext (1024) 4 |
|
|
|
|
| 1D2D (218) 5 | 0.364 | 0.775 | 0.600 | 49/51 |
1 Root mean squared error. 2 Mean absolute error. 3 Percent of molecules predicted with absolute error above or below 0.5. 4 Fingerprints. 5 Molecular descriptors.
Exploration of 3D descriptors for building anti-MRSA RF models with the best four set of fingerprints (FPs) (MACCS, SubC, PubChem, and CDK Ext) for the training set in an OOB estimation.
| Descriptors (#) | R2 | RMSE | MAE | % Error ≥ 0.5/% Error < 0.5 1 |
|---|---|---|---|---|
| MACCS_3D_CDK (232) 2 | 0.441 | 0.731 | 0.569 | 47/53 |
| MACCS_3D_RDF (550) 3 | 0.426 | 0.745 | 0.587 | 45/55 |
| SubC_3D_CDK (373) 2 | 0.466 | 0.715 | 0.556 | 47/53 |
| SubC_3D_RDF (691) 3 | 0.448 | 0.730 | 0.574 | 48/52 |
| PubChem_3D_CDK (947) 2 | 0.479 | 0.705 | 0.546 | 45/55 |
| PubChem_3D_RDF (1265) 3 | 0.465 | 0.719 | 0.564 | 48/52 |
| CDK_Ext_3D_CDK (1090) 2 | 0.537 | 0.667 | 0.507 | 41/59 |
| CDK_Ext_3D_RDF (1408) 3 | 0.507 | 0.690 | 0.533 | 44/56 |
1 Percent of molecules predicted with absolute error above or below 0.5. 2 Comprising in total 66 descriptors, 6 3D-BCUTS, 29 CPSA, 9 Gravitational index, 7 Moment of inertia, 2 Petitjean shape index, and 13 WHIM. 3 RDF Pair calculated in sets of charge pairs covering a distance of 12.8 Å, obtaining in total 384 descriptors.
RF Prediction of pMIC against MRSA with subsets of descriptors from models SubC, PubChem, and CDK Ext.
| Model/n.° Descriptors | R2 | RMSE | MAE | % Error ≥ 0.5/% Error < 0.5 1 |
|---|---|---|---|---|
| Training set 2 | ||||
| SubC/75 3 | 0.556 | 0.647 | 0.471 | 36/64 |
| SubC/100 3 |
|
|
|
|
| SubC/150 3 | 0.562 | 0.643 | 0.467 | 36/64 |
| Pubchem/75 3 | 0.528 | 0.667 | 0.496 | 39/61 |
| PubChem/100 3 | 0.544 | 0.656 | 0.484 | 38/62 |
| PubChem/150 3 |
|
|
|
|
| CDK Ext/75 3 | 0.545 | 0.655 | 0.481 | 37/63 |
| CDK Ext/100 3 | 0.566 | 0.641 | 0.467 | 36/64 |
| CDK Ext/150 3 |
|
|
|
|
| Test set | ||||
| SubC/75 3 | 0.637 | 0.626 | 0.464 | 36/64 |
| SubC/100 3 | 0.645 | 0.620 | 0.459 | 36/64 |
| SubC/150 3 | 0.644 | 0.621 | 0.460 | 36/64 |
| Pubchem/75 3 | 0.603 | 0.654 | 0.494 | 39/61 |
| Pubchem/100 3 | 0.620 | 0.639 | 0.475 | 37/63 |
| Pubchem/150 3 | 0.632 | 0.629 | 0.471 | 37/63 |
| CDK Ext/75 3 | 0.609 | 0.650 | 0.483 | 38/62 |
| CDK Ext/100 3 | 0.632 | 0.631 | 0.467 | 37/63 |
| CDK Ext/150 3 | 0.641 | 0.623 | 0.458 | 34/66 |
1 Percent of molecules predicted with absolute error above or below 0.5. 2 OOB estimation. 3 Using the mean decrease in accuracy measure of importance for the descriptors in the RF algorithm.
Performance of different machine learning (ML) algorithms by the five structural clusters for the training set. The models comprising all the molecules of training set are highlighted in bold.
| ML | R2 | RMSE | MAE | |
|---|---|---|---|---|
| RF 1 |
|
|
|
|
| A | 0.583 | 0.555 | 0.416 | |
| B | 0.549 | 0.622 | 0.458 | |
| C | 0.445 | 0.757 | 0.520 | |
| D | 0.594 | 0.641 | 0.467 | |
| E | 0.575 | 0.600 | 0.449 | |
| SVM 2 |
|
|
|
|
| A | 0.584 | 0.556 | 0.407 | |
| B | 0.518 | 0.649 | 0.463 | |
| C | 0.466 | 0.743 | 0.508 | |
| D | 0.569 | 0.659 | 0.465 | |
| E | 0.570 | 0.606 | 0.443 | |
| GPs 2 |
|
|
|
|
| A | 0.590 | 0.548 | 0.415 | |
| B | 0.528 | 0.636 | 0.466 | |
| C | 0.450 | 0.752 | 0.530 | |
| D | 0.583 | 0.647 | 0.474 | |
| E | 0.577 | 0.599 | 0.442 |
1 OOB estimation for the training set. 2 Ten-fold cross-validation for the training set.
Figure 1Performance of different ML algorithms by the five structural clusters for the test set.
Performance of the consensus models (CM1 and CM2) predicting pMIC against MRSA for the training and test sets.
| Model | R2 | RMSE | MAE | % Error ≥ 0.5/% Error < 0.5 1 |
|---|---|---|---|---|
| Training set | ||||
| CM1 2 | 0.587 | 0.624 | 0.450 | 33/67 |
| CM2 3 | 0.601 | 0.617 | 0.453 | 34/66 |
| Test set | ||||
| CM1 | 0.644 | 0.617 | 0.453 | 33/67 |
| CM2 | 0.683 | 0.593 | 0.444 | 35/65 |
1 Percent of molecules predicted with absolute error above or below 0.5. 2 Averaged predictions obtained by the RF, SVM and GPs models with the 150 most important CDK Ext FPs with OOB estimation and ten-fold cross-validation for the training set using RF and the other ML techniques, respectively. 3 Averaged predictions obtained by the four best set of descriptors, MACCS, SubC, PubChem and CDK Ext for the training set in an OOB estimation.
Figure 2Predicted versus experimental pMIC against MRSA for the 1533 molecular structures of the test set.
Figure 3Mapping of the trained and predicted structural clusters of the active and inactive molecules against MRSA on self-organizing map (SOM) for the: (a) Training set; (b) Test set. Red—cluster A, dark blue—cluster B, green—cluster C, light yellow—cluster D, light blue—cluster E.
Figure 4Mapping of the predicted structural clusters of the active and inactive molecules against MRSA on SOM for the StreptomeDB 2.0 library. Red—cluster A, dark blue—cluster B, green—cluster C, light yellow—cluster D, light blue—cluster E.
The Twelve resulting virtual screening hits from the StreptomeDB 2.0 library.
| ID 1 | Name | Type | pMIC 2 | Cluster 3 | ASD |
|---|---|---|---|---|---|
| 10301 | AGN-PC-07NF8H | Bis-pyrrole | 6.06 | A | 0.23 |
| 10232 | Marinopyrrole A | Bis-pyrrole | 5.92 | A | 0.23 |
| 5508 | Azalomycin | Spiro-tricyclic | 5.51 | A | 0.39 |
| 3643 | Methylsulfomycin I | Pyridine-containing 4 | 7.08 | E | 0.33 |
| 5495 | a10255 | Pyridine-containing 4 | 6.51 | E | 0.31 |
| 10186 | GE37468 | Pyridine-containing 4 | 6.41 | E | 0.35 |
| 3971 | Berninamycin C | Pyridine-containing 4 | 6.23 | E | 0.38 |
| 8767 | Cyclothiazomycin | Polythiazole-containing 5 | 6.02 | E | 0.34 |
| 4999 | Tallysomycin | Glycopeptide | 5.42 | E | 0.39 |
| 3183 | Cleomycin B2 | Glycopeptide | 5.36 | E | 0.37 |
| 9530 | Bleomycin z | Glycopeptide | 5.32 | E | 0.39 |
| 3322 | Bottromycin A2 | Macrocyclic peptide | 5.30 | E | 0.31 |
1 StreptomeDB ID number. 2 Predicted pMIC. 3 Estimate structural cluster on SOM. 4 Thiopeptide. 5 Peptide.
Figure 5Chemical structures of the twelve resulting virtual screening hits. 1 Biological activity reported in StreptomeDB 2.0.
Exploration of three collections of NMR descriptors for the QSAR RF model of antibacterial activity against MRSA classes for the training and test sets.
| Model | # 1 | TP 2 | TN 3 | FP 4 | FN 5 | SE 6 | SP 7 | Q 8 | MCC 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Training set 10 | ||||||||||
| 13C | 0.5 | 400 | 12 | 72 | 8 | 24 | 0.33 | 0.90 | 0.72 | 0.29 |
| 1 | 200 | 9 | 71 | 9 | 27 | 0.25 | 0.89 | 0.69 | 0.18 | |
| 1.5 | 133 | 12 | 72 | 8 | 24 | 0.33 | 0.90 | 0.72 | 0.29 | |
| 1H | 0.05 | 240 | 10 | 73 | 7 | 26 | 0.28 | 0.91 | 0.72 | 0.25 |
| 0.1 | 120 | 12 | 72 | 8 | 24 | 0.33 | 0.90 | 0.72 | 0.29 | |
| 0.2 | 61 | 14 | 71 | 9 | 22 | 0.39 | 0.89 | 0.73 | 0.32 | |
| 0.5 | 23 | 8 | 69 | 11 | 28 | 0.22 | 0.86 | 0.66 | 0.11 | |
| 13C | 0.5 | 461 | 13 | 75 | 5 | 23 | 0.36 | 0.94 | 0.76 | 0.38 |
| Test set | ||||||||||
| 13C | 0.5 | 400 | 4 | 24 | 2 | 9 | 0.31 | 0.92 | 0.72 | 0.30 |
| 1 | 200 | 2 | 25 | 1 | 11 | 0.15 | 0.96 | 0.69 | 0.20 | |
| 1.5 | 133 | 1 | 22 | 4 | 12 | 0.08 | 0.85 | 0.59 | 0.11 | |
| 1H | 0.05 | 240 | 7 | 22 | 4 | 6 | 0.54 | 0.85 | 0.74 | 0.40 |
| 0.1 | 120 | 7 | 21 | 5 | 6 | 0.54 | 0.81 | 0.72 | 0.36 | |
| 0.2 | 61 | 8 | 21 | 5 | 5 | 0.62 | 0.81 | 0.74 | 0.42 | |
| 0.5 | 23 | 7 | 22 | 4 | 6 | 0.54 | 0.85 | 0.74 | 0.40 | |
| 13C | 0.5 | 461 | 7 | 24 | 2 | 6 | 0.54 | 0.92 | 0.79 | 0.52 |
1 Number of descriptors. 2 True positives. 3 True negatives. 4 False positives. 5 False negatives. 6 Sensitivity, the ratio of true positives to the sum of true positives and false negatives. 7 Specificity, the ratio of true negatives to the sum of true negatives and false positives. 8 Overall predictive accuracy, the ratio of the sum of true positives and true negatives to the sum of true positives, true negatives, false positives and false negatives. 9 Matthews correlation coefficient. 10 OOB estimation.
Exploration of different ML algorithms in the prediction of two classes of antibacterial activity against MRSA using the 100 most important descriptors for the training and test sets.
| ML | SE 1 | SP 2 | Q 3 | MCC 4 |
|---|---|---|---|---|
| Training set | ||||
| RF 5 | 0.56 | 0.91 | 0.80 | 0.51 |
| SVM 6 | 0.72 | 0.81 | 0.78 | 0.52 |
| CNN 6 | 0.61 | 0.89 | 0.80 | 0.52 |
| Test set | ||||
| RF | 0.46 | 0.92 | 0.77 | 0.45 |
| SVM | 0.69 | 0.73 | 0.72 | 0.41 |
| CNN | 0.62 | 0.81 | 0.74 | 0.42 |
1 Sensitivity. 2 Specificity. 3 Overall predictive accuracy. 4 Matthews correlation coefficient. 5 OOB estimation. 6 Ten-fold cross-validation.
Prediction of activity classes against MRSA of the four pure compounds with the CM_NMR model.
| Code | Actinobacteria Genera | Structural Family | Activity Class 1 | MIC (μg/mL) 2 | Activity Class 2 |
|---|---|---|---|---|---|
| PTM-290 F7,F26 |
| Diketopiperazine | InAct 3 | >250 | InAct 3 |
| PTM-290 F7,F27 |
| Diketopiperazine | InAct 3 | >250 | InAct 3 |
| PTM-420 F4,F15 |
| Unknown | InAct 3 | >250 | InAct 3 |
| PTM-420 F5,F45 |
| Napyradiomycin | InAct 3 | 62.5 | MAct 4 |
1 Predicted. 2 Experimental. 3 Inactive. 4 Moderate-active-to-active.
Analysis of NMR descriptors for modeling anti-MRSA activity.
| H or C (# 1) | NMR Range (ppm) | Ranking 2 | Importance for Classes | Pattern Identification | |
|---|---|---|---|---|---|
| InAct 3 | MAct 4 | ||||
| H (19) | 11.2393–11.5676 | 1st | 9.23 | 6.59 | Hydrogen bond C |
| H (8) | 13.8656–14.1939 | 2nd | 6.22 | 4.86 | Hydrogen bond C |
| H (22) | 10.5828–10.9111 | 3rd | 3.92 | 4.86 | Hydrogen bond C |
| H (28) | 9.5979–9.9262 | 8th | 2.30 | 2.46 | Aldehyde C |
| C (318) | 127.4927–127.9927 | 9th | 3.08 | −0.02 | Aromatic; olefinic; nitrile |
| H (58) | 1.3909–1.7191 | 10th | 2.91 | 1.77 | Saturated alkane |
| H (48) | 7.3000–7.6282 | 11th | 2.17 | 1.05 | Aromatic; conjugated olefinic |
| C (410) | 175.9927–176.4927 | 14th | 1.85 | 0.08 | COX; X: O, N, Cl |
| C (321) | 33.9927–34.4927 | 15th | 2.61 | 0.58 | – |
| C (350) | 168.9927–169.4927 | 16th | 1.97 | 0.43 | COX; X: O, N, Cl,b unsat. COX; X: O, N, Cl |
| H (36) | 5.6585–5.9868 | 18th | 2.66 | 0.48 | Vinylic |
| C (141) | 52.4927–52.9927 | 19th | 2.24 | 1.00 | – |
| C (329) | 123.9927–124.4927 | 20th | 1.86 | 0.93 | Vinylic |
| C (415) | 171.9927–172.4927 | 21th | 2.14 | 3.40 | COX; X: O, N, Cl,b unsat. COX; X: O, N, Cl |
| C (401) | 178.4927–178.9927 | 23th | 2.57 | 0.18 | COX; X: O, N, Cl,b unsat. COX; X: O, N, Cl |
1 Number of descriptor. 2 Descriptor importance. 3 Inactive. 4 Moderate-active-to-active.