| Literature DB >> 30018273 |
Sara Cruz1, Sofia E Gomes2, Pedro M Borralho3, Cecília M P Rodrigues4, Susana P Gaudêncio5,6, Florbela Pereira7.
Abstract
To discover new inhibitors against the human colon carcinoma HCT116 cell line, two quantitative structure⁻activity relationship (QSAR) studies using molecular and nuclear magnetic resonance (NMR) descriptors were developed through exploration of machine learning techniques and using the value of half maximal inhibitory concentration (IC50). In the first approach, A, regression models were developed using a total of 7339 molecules that were extracted from the ChEMBL and ZINC databases and recent literature. The performance of the regression models was successfully evaluated by internal and external validations, the best model achieved R² of 0.75 and 0.73 and root mean square error (RMSE) of 0.66 and 0.69 for the training and test sets, respectively. With the inherent time-consuming efforts of working with natural products (NPs), we conceived a new NP drug hit discovery strategy that consists in frontloading samples with 1D NMR descriptors to predict compounds with anticancer activity prior to bioactivity screening for NPs discovery, approach B. The NMR QSAR classification models were built using 1D NMR data (¹H and 13C) as descriptors, from 50 crude extracts, 55 fractions and five pure compounds obtained from actinobacteria isolated from marine sediments collected off the Madeira Archipelago. The overall predictability accuracies of the best model exceeded 63% for both training and test sets.Entities:
Keywords: HCT116 cell line; NMR descriptors; anticancer activity; machine learning (ML); marine natural products (MNPs); marine-derived actinobacteria; molecular descriptors; quantitative structure–activity relationship (QSAR)
Mesh:
Substances:
Year: 2018 PMID: 30018273 PMCID: PMC6164384 DOI: 10.3390/biom8030056
Source DB: PubMed Journal: Biomolecules ISSN: 2218-273X
Figure 1Flowchart representing the HCT116 quantitative structure–activity relationship (QSAR) model building process (I, left), illustrated with results obtained here as well as future applications (II, right).
Structural Clusters and pIC50 Values for HCT116 within the Clusters.
| Clusters 1 | Training Set 2 | Test Set 2 | Average/Maximum pIC50 3 |
|---|---|---|---|
| I—ChEMBL1078221 | 650 | 129 | 5.24/11.00 |
| II—ChEMBL1078389 | 569 | 135 | 5.43/9.52 |
| III—ChEMBL148968 | 438 | 105 | 5.42/9.60 |
| IV—ChEMBL116081 | 713 | 200 | 5.83/11.51 |
| V—ChEMBL1083086 | 885 | 208 | 5.32/9.26 |
| VI—ChEMBL104408 | 405 | 119 | 5.80/9.31 |
| VII—ChEMBL1090871 | 661 | 159 | 5.39/12.00 |
| VIII—ChEMBL116614 | 626 | 152 | 5.76/9.24 |
| IX—ChEMBL1078573 | 513 | 125 | 5.74/10.35 |
| X—ChEMBL1830679 | 415 | 132 | 5.77/9.05 |
1 Cluster number and chemical structure of the cluster centroid; 2 Number of molecules; 3 Within the cluster for the training set.
Actinomycetes genera and correspondent IC50 values for HCT116.
| Actinomycetes Genera | Set (Number/Sample Types) | Activity Class/Average IC50 1 |
|---|---|---|
|
| Tr 2 set (2, crude extracts) | inactive/≥156 |
|
| Tr 2 set (4, 1 crude extract and 3 fractions) | active/33.95 |
|
| Tr 2 set (11, 3 crude extracts and 8 fractions) | inactive/≥156 |
|
| Tr 2 set (1, 1 fraction) | active/9.8 |
|
| Tr 2 set (20, 9 crude extracts and 11 fractions) | inactive/≥156 |
|
| Tr 2 set (20, 11 crude extracts and 9 fractions) | active/16.26 |
|
| Tr 2 set (16, 9 crude extracts and 7 fractions) | inactive/≥15 |
|
| Te 3 set (1, crude extract) | inactive/≥156 |
|
| Te 3 set (1, crude extract) | inactive/≥156 |
|
| Te 3 set (1, crude extract) | active/7.9 |
|
| Te 3 set (11, 1 crude extract, 5 fractions, and 5 pure compounds) | inactive/≥156 |
|
| Te 3 set (1, crude fraction) | active/4.94 |
|
| Te 3 set (7, 5 crude extracts and 2 fractions) | inactive/≥156 |
|
| Te 3 set (7, 2 crude extracts and 5 fractions) | active/26.31 |
|
| Te 3 set (7, 3 crude extracts and 4 fractions) | inactive/≥156 |
1 μg/mL; 2 Training set; 3 Test set.
Exploration of two collections of empirical descriptors for the quantitative structure-activity relationship k-nearest neighbors (QSAR k-NN) model of pIC50 for the training set with a ten-fold cross-validation. The best models are highlighted in bold.
| Descriptors (#) | CFS Search Type | NO. of Selected Descriptors | R2 | RMSE | MAE | % error ≥ 1/% error < 1 1 |
|---|---|---|---|---|---|---|
| E-State (79) 2 | GSW 4 | 13 | 0.174 | 1.208 | 0.927 | 38/62 |
| MACCS (166) 2 | PSOs 5 | 34 | 0.512 | 0.937 | 0.665 | 22/78 |
| Sub (307) 2 | PSOs 5 | 63 | 0.372 | 1.055 | 0.797 | 30/70 |
| SubC (307) 2 | BF 6 | 63 | 0.509 | 0.942 | 0.671 | 23/77 |
| AP2D (780) 2 | PSOs 5 | 120 | 0.442 | 1.007 | 0.702 | 23/77 |
| APC2D (780) 2 | PSOs 5 | 174 | 0.589 | 0.866 | 0.589 | 18/82 |
| PubChem (881) 2 | PSOs 5 | 252 |
|
|
|
|
| CDK (1024) 2 | PSOs 5 | 283 |
|
|
|
|
| CDK Ext (1024) 2 | PSOs 5 | 257 |
|
|
|
|
| CDK graph (1024) 2 | PSOs 5 | 179 | 0.644 | 0.807 | 0.546 | 16/84 |
| KR (4860) 2 | PSOs 5 | 192 | 0.604 | 0.847 | 0.591 | 19/81 |
| KRC (4860) 2 | PSOs 5 | 160 | 0.618 | 0.832 | 0.579 | 18/82 |
| 1D2D (1438) 3 | PSOs 5 | 416 |
|
|
|
|
| 1D2D3D (1869) 3 | PSOs 5 | 489 |
|
|
|
|
1 Percent of molecules predicted with absolute error above or below 1; 2 Fingerprints; 3 Molecular descriptors; 4 GreedyStepwise option for search; 5 PSOsearch option for search; 6 BestFirst option for search. Abbreviations: RMSE, root mean square error; MAE, mean absolute error.
Performance of different machine learning algorithms. The best models are highlighted in bold.
| Models | ML | |||
|---|---|---|---|---|
| RF 1 | SVM 2 | |||
| 1D2D 3 | R2 | 0.730 | 0.647 | 0.703 |
| RMSE | 0.708 | 0.800 | 0.737 | |
| MAE | 0.523 | 0.566 | 0.493 | |
| % error ≥ 1/% error < 1 7 | 13/87 | 16/84 | 13/87 | |
| 1D2D3D 4 | R2 | 0.729 | 0.615 | 0.705 |
| RMSE | 0.713 | 0.842 | 0.733 | |
| MAE | 0.525 | 0.572 | 0.493 | |
| % error ≥ 1/% error < 1 7 | 13/87 | 17/83 | 13/87 | |
| PubChem 5 | R2 |
| 0.677 | 0.696 |
| RMSE |
| 0.762 | 0.742 | |
| MAE |
| 0.535 | 0.500 | |
| % error ≥ 1/% error < 1 7 |
| 15/85 | 14/86 | |
| CDK 6 | R2 | 0.753 |
|
|
| RMSE | 0.665 |
|
| |
| MAE | 0.471 |
|
| |
| % error ≥ 1/% error < 1 7 | 11/89 |
|
| |
1 Out-of-bag (OOB) estimation for the training set; 2 Ten-fold cross-validation for the training set; 3 1438 and 416 descriptors for random forest (RF) and support vector machines/k-nearest neighbors (SVM/k-NN), respectively; 4 1869 and 489 descriptors for RF and SVM/k-NN, respectively; 5 881 and 252 descriptors for RF and SVM/k-NN, respectively; 6 1024 and 257 descriptors for RF and SVM/k-NN, respectively; 7 Percent of molecules predicted with absolute error above or below 1.
Figure 2Analysis of Descriptor Selection Using RF algorithm in an OOB estimation for the training set.
The predictions of the best HCT116 QSAR model by the ten structural clusters for training and test sets. The best models are highlighted in bold.
| Clusters | Training Set | Test Set | ||||||
|---|---|---|---|---|---|---|---|---|
| # | R2 | RMSE | MAE | # | R2 | RMSE | MAE | |
|
| 650 | 0.702 |
| 0.479 | 129 | 0.613 | 0.673 | 0.455 |
|
| 569 | 0.766 | 0.627 | 0.446 | 135 | 0.781 | 0.619 | 0.461 |
|
| 438 | 0.792 | 0.648 | 0.459 | 105 | 0.697 | 0.737 | 0.516 |
|
| 713 | 0.759 |
| 0.459 | 200 | 0.682 | 0.821 | 0.533 |
|
| 885 | 0.685 | 0.658 | 0.481 | 208 | 0.658 | 0.734 | 0.489 |
|
| 405 | 0.649 | 0.646 | 0.460 | 119 | 0.637 | 0.616 | 0.462 |
|
| 661 | 0.790 | 0.652 | 0.445 | 159 | 0.776 | 0.625 | 0.430 |
|
| 626 | 0.636 |
| 0.512 | 152 | 0.706 | 0.585 | 0.432 |
|
| 513 | 0.846 | 0.599 | 0.412 | 125 | 0.794 | 0.720 | 0.487 |
|
| 415 | 0.746 | 0.628 | 0.440 | 132 | 0.767 | 0.659 | 0.448 |
Figure 3Predicted vs. experimental pIC50 against HCT116 for the 129, 200 and 1135 molecular structures of I, IV and others clusters of the test set, respectively.
Analysis of descriptor importance using to build the best QSAR model for the prediction of the pIC50 against HCT116.
| Code | DI 1 | Chemical Pattern |
|---|---|---|
| HEC_2 | 17th | ≥16C |
| HEC_19 | 16th | ≥2O |
| ESSSR_157 | 10th | ≥3 any ring size 5 |
| ESSSR_261 | 5th | ≥4 aromatic rings |
| SAP_301 | 18th | N-O |
| SAP_305 | 19th | N-S |
| SANN_335 | 7th | |
| SANN_338 | 4th | |
| SANN_339 | 11th | |
| SANN_346 | 15th | |
| DANh_432 | 8th |
|
| SSP_514 | 12th |
|
| SSP_518 | 6th |
|
| SSP_615 | 13th |
|
| SSP_631 | 20th |
|
| SSP_643 | 3rd |
|
| SSP_672 | 14th |
|
| CSP_713 | 1st |
|
| CSP_755 | 9th |
|
| CSP_819 | 2nd |
|
1 Descriptors importance.
Figure 4Mapping of the trained and predicted structural clusters of the active and inactive molecules against HCT116 on SOM for the: (a) training set; (b) test set. Red—Cluster I, Dark blue—Cluster II, Green—Cluster III, Light yellow—Cluster IV, Light blue—Cluster V, Pink—Cluster VI, Dark yellow—Cluster VII, Purple—Cluster VIII, Dark grey—Cluster IX, Light grey—Cluster X.
Exploration of three collections of NMR descriptors for the QSAR RF model of HCT116 activity classes for the training and test sets. The best models are highlighted in bold.
| Model | # 2 | Training 1/Test Sets | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| TP 3 | TN 4 | FP 5 | FN 6 | SE 7 | SP 8 | Q 9 | G-Mean 10 | |||
|
| 0.5 | 400 | 12/3 | 38/20 | 11/7 | 13/6 | 0.48/0.33 | 0.78/0.74 | 0.68/0.64 | 0.61/0.50 |
| 1 | 200 | 13/3 | 35/16 | 8/7 | 12/6 | 0.52/0.33 | 0.81/0.70 | 0.71/0.59 | 0.65/0.48 | |
| 1.5 | 133 | 12/2 | 42/20 | 7/7 | 13/7 | 0.48/0.22 | 0.86/0.74 | 0.73/0.61 | 0.64/0.41 | |
|
| 0.05 | 240 | 13/2 | 41/20 | 8/7 | 12/7 | 0.52/0.22 | 0.84/0.74 | 0.73/0.61 | 0.66/0.41 |
| 0.1 | 120 | 15/2 | 40/19 | 9/8 | 10/7 | 0.60/0.22 | 0.82/0.70 | 0.74/0.58 | 0.70/0.40 | |
| 0.2 | 61 | 14/2 | 39/18 | 10/9 | 11/7 | 0.56/0.22 | 0.80/0.67 | 0.72/0.56 | 0.67/0.39 | |
|
| 0.05; 0.5 | 640 | 13/4 | 44/21 | 5/6 | 12/5 | 0.52/0.44 | 0.90/0.78 | 0.77/0.69 | 0.68/0.59 |
| 0.1; 0.5 | 520 | 14/5 | 44/19 | 5/8 | 11/4 | 0.56/0.56 | 0.90/0.70 | 0.78/0.67 |
| |
| 0.1; 1 | 320 | 13/3 | 44/19 | 5/8 | 12/6 | 0.52/0.33 | 0.90/0.70 | 0.77/0.61 | 0.68/0.48 | |
1 OOB estimation; 2 Number of descriptors; 3 True positives; 4 True negatives; 5 False positives; 6 False negatives; 7 Sensitivity, the ratio of true positives to the sum of true positives and false negatives; 8 Specificity, the ratio of true negatives to the sum of true negatives and false positives; 9 Overall predictive accuracy, the ratio of the sum of true positives and true negatives to the sum of true positives, true negatives, false positives and false negatives; 10 The square root of the product of sensitivity and specificity.
Balance the moderate-active-to-active and inactive classes for the best NMR RF model of HCT116 activity classes for the Training and Test Sets.
| Sets | TP 1 | TN 2 | FP 3 | FN 4 | SE 5 | SP 6 | Q 7 | G-Mean 8 |
|---|---|---|---|---|---|---|---|---|
| Training | 18 | 36 | 13 | 7 | 0.72 | 0.74 | 0.73 | 0.73 |
| Test | 6 | 17 | 10 | 3 | 0.67 | 0.63 | 0.64 | 0.65 |
1 True positives; 2 True negatives; 3 False positives; 4 False negatives; 5 Sensitivity, the ratio of true positives to the sum of true positives and false negatives; 6 Specificity, the ratio of true negatives to the sum of true negatives and false positives; 7 Overall predictive accuracy, the ratio of the sum of true positives and true negatives to the sum of true positives, true negatives, false positives and false negatives; 8 The square root of the product of sensitivity and specificity.
Prediction of activity classes against HCT116 of the five pure compounds with the best model.
| Code | Activity Class | Probability of Being Moderate-Active-to-Active |
|---|---|---|
| PTM-99_F2_F27 | Inactive | 0.26 |
| PTM-99_F2_F31 | Inactive | 0.42 |
| PTM-420_F4_F15 | Moderate-active-to-active | 0.64 |
| PTM-420_F5_F42 | Moderate-active-to-active | 0.53 |
| PTM-420_F5_F43 | Moderate-active-to-active | 0.55 |
Analysis of NMR Descriptors for modeling HCT116 activity in the best RF model.
| H or C (# 1) | NMR Range (ppm) | DI 2 | Importance for Classes | Pattern Identification | |
|---|---|---|---|---|---|
| MAct-Act 3 | InAct 4 | ||||
| H (14) | 1.3019–1.4019 | 1st | 5.43 | 5.97 | Saturated |
| H (44) | 4.3019–4.4019 | 2nd | 5.90 | 4.46 | Z = O, N, X 5
|
| H (2) | 0.1019–0.2019 | 3rd | 6.43 | 4.01 | Saturated |
| H (3) | 0.2019–0.3019 | 4th | 4.79 | 4.20 | Saturated |
| H (4) | 0.3019–0.4019 | 5th | 3.94 | 4.60 | Saturated |
| H (45) | 4.4019–4.5019 | 6th | 4.49 | 4.13 | Z = O, N, X 5
|
| H (5) | 0.4019–0.5019 | 7th | 3.27 | 4.04 | Saturated |
| C (271) | 74.9927–75.4927 | 8th | 2.00 | 2.98 | Alcohol and ethers |
| H (6) | 0.5019–0.6019 | 9th | 1.77 | 3.25 | Saturated |
| H (52) | 5.1019–5.2019 | 10th | 1.87 | 2.67 | Vinylic |
| H (32) | 3.1019–3.2019 | 12th | 0.881 | 2.87 | Z = O, N, X 5
|
| H (51) | 5.0019–5.1019 | 15th | 0.0887 | 2.73 | Vinylic |
| C (170) | 24.4927–24.9927 | 20th | 0.712 | 2.14 | Allylic |
| C (352) | 115.4927–115.9927 | 21th | 2.12 | 0.833 | Aromatic |
| C (280) | 79.4927–79.9927 | 26th | 0.0743 | 1.88 | Alcohol and ethers |
| H (73) | 7.2019–7.3019 | 32th | 0.083 | 1.91 | Aromatic |
| H (13) | 1.2019–1.3019 | 49th | 1.42 | 0.0695 | Saturated |
1 Number of descriptor; 2 Descriptors importance; 3 Moderate-active-to-active class; 4 Inactive class; 5 Halogen.