| Literature DB >> 34213323 |
Raghad Al-Jarf1,2,3, Alex G C de Sá1,2,3,4, Douglas E V Pires1,2,3,5, David B Ascher1,2,3,4,6.
Abstract
The development of new, effective, and safe drugs to treat cancer remains a challenging and time-consuming task due to limited hit rates, restraining subsequent development efforts. Despite the impressive progress of quantitative structure-activity relationship and machine learning-based models that have been developed to predict molecule pharmacodynamics and bioactivity, they have had mixed success at identifying compounds with anticancer properties against multiple cell lines. Here, we have developed a novel predictive tool, pdCSM-cancer, which uses a graph-based signature representation of the chemical structure of a small molecule in order to accurately predict molecules likely to be active against one or multiple cancer cell lines. pdCSM-cancer represents the most comprehensive anticancer bioactivity prediction platform developed till date, comprising trained and validated models on experimental data of the growth inhibition concentration (GI50%) effects, including over 18,000 compounds, on 9 tumor types and 74 distinct cancer cell lines. Across 10-fold cross-validation, it achieved Pearson's correlation coefficients of up to 0.74 and comparable performance of up to 0.67 across independent, non-redundant blind tests. Leveraging the insights from these cell line-specific models, we developed a generic predictive model to identify molecules active in at least 60 cell lines. Our final model achieved an area under the receiver operating characteristic curve (AUC) of up to 0.94 on 10-fold cross-validation and up to 0.94 on independent non-redundant blind tests, outperforming alternative approaches. We believe that our predictive tool will provide a valuable resource to optimizing and enriching screening libraries for the identification of effective and safe anticancer molecules. To provide a simple and integrated platform to rapidly screen for potential biologically active molecules with favorable anticancer properties, we made pdCSM-cancer freely available online at http://biosig.unimelb.edu.au/pdcsm_cancer.Entities:
Mesh:
Year: 2021 PMID: 34213323 PMCID: PMC8317153 DOI: 10.1021/acs.jcim.1c00168
Source DB: PubMed Journal: J Chem Inf Model ISSN: 1549-9596 Impact factor: 4.956
Figure 1pdCSM-cancer workflow. The developed approach has four major stages. (1) In data curation, small-molecule activity data (in terms of GI50%) were obtained from DTP of NCI[23] for nine different tumor types (74 cancer cell lines); (2) in feature engineering, two types of features were calculated: (i) graph-based signatures, which represent the chemical geometry and physicochemical properties of small molecules, and (ii) compound general properties and pharmacophores; (3) these features were then employed to train and test predictive models using supervised learning, and model optimization was carried out, via greedy feature selection; (4) finally, the models with the best performance were made available through a user-friendly web interface.
Figure 2Molecular substructure enrichment in compounds with anticancer activity. (A) The first fragment occurred 6.50% in active and 1.78% in inactive molecules, tested against the CNS, ovarian, and non-small-cell lung cancer cell panels (lines). (B) The second fragment occurred 5.95% in active and 1.86% in inactive molecules, tested against leukemia, renal, and breast cancer cell panels. (C) The third fragment was identified in 6.58% in active and 1.70% in inactive molecules, tested against prostate (not shown), melanoma, and colon cancer cell lines. (D) The fourth fragment was found in 11% in active and 2.0% in inactive compounds, tested against small lung cancer cell lines.
Figure 3Receiver operating characteristic (ROC) curves of the pdCSM-cancer classification model. Our predictive model accurately identified active molecules with AUC > 0.84 on cross-validation and blind tests. (A) Performance of pdCSM-cancer on the CDRUG data. (B) Performance of pdCSM-cancer predictor on the updated NCI-60 data.
Comparative Performance on Cross-Validation between the pdCSM-cancer Classification Model and Another Available Approach, CDRUGa
| method | MCC | sensitivity | FPR | accuracy | AUC |
|---|---|---|---|---|---|
| pdCSM-cancer (updated NCI-60 data) | 0.72 | 0.84 | 0.13 | 0.86 | 0.94 |
| pdCSM-cancer (CDRUG data) | 0.70 | 0.85 | 0.15 | 0.86 | 0.90 |
| CDRUG[ | * | 0.81 | 0.20 | * | 0.87 |
Asterisk: MCC and accuracy predictive scores were not reported by CDRUG, neither the prediction matrix (i.e., predicted and actual values for each molecule).
Performance of the Final pdCSM-Cancer Regression Models on Cross-Validation and Blind Test Sets
| tissue | cell lines | Pearson (CV) | Pearson (blind test) |
|---|---|---|---|
| CNS | SF-268 | 0.66 | 0.59 |
| SF-295 | 0.68 | 0.59 | |
| SF-539 | 0.66 | 0.59 | |
| SNB-19 | 0.66 | 0.61 | |
| SNB-75 | 0.63 | 0.59 | |
| SNB-78 | 0.61 | 0.52 | |
| U251 | 0.69 | 0.63 | |
| XF-498 | 0.60 | 0.51 | |
| breast | BT-549 | 0.65 | 0.56 |
| HS-578 T | 0.64 | 0.54 | |
| MCF7 | 0.69 | 0.59 | |
| MDA-MB-231_ATCC | 0.68 | 0.58 | |
| MDA-MB-468 | 0.58 | 0.49 | |
| T-47D | 0.64 | 0.56 | |
| colon | COLO-205 | 0.67 | 0.58 |
| DLD-1 | 0.66 | 0.59 | |
| HCC-2998 | 0.63 | 0.63 | |
| HCT-116 | 0.69 | 0.61 | |
| HCT-15 | 0.63 | 0.56 | |
| HT29 | 0.67 | 0.65 | |
| KM12 | 0.66 | 0.59 | |
| KM20L2 | 0.62 | 0.53 | |
| SW-620 | 0.69 | 0.59 | |
| leukemia | CCRF-CEM | 0.65 | 0.59 |
| HL-60 TB | 0.63 | 0.56 | |
| K-562 | 0.66 | 0.57 | |
| MOLT-4 | 0.67 | 0.59 | |
| P388_ADR | 0.69 | 0.67 | |
| P388 | 0.74 | 0.65 | |
| RPMI-8226 | 0.63 | 0.59 | |
| SR | 0.63 | 0.62 | |
| ovarian | IGROV1 | 0.66 | 0.57 |
| NCI_ADR-RES | 0.65 | 0.56 | |
| OVCAR-3 | 0.67 | 0.58 | |
| OVCAR-4 | 0.63 | 0.62 | |
| OVCAR-5 | 0.64 | 0.54 | |
| OVCAR-8 | 0.68 | 0.59 | |
| SK-OV-3 | 0.65 | 0.57 | |
| prostate | DU-145 | 0.67 | 0.58 |
| PC-3 | 0.69 | 0.59 | |
| renal | 786-0 | 0.65 | 0.56 |
| A498 | 0.64 | 0.63 | |
| ACHN | 0.68 | 0.59 | |
| CAKI-1 | 0.65 | 0.56 | |
| RXF-393 | 0.65 | 0.56 | |
| RXF-631 | 0.66 | 0.63 | |
| SN12C | 0.65 | 0.59 | |
| SN12K1 | 0.74 | 0.65 | |
| TK-10 | 0.64 | 0.61 | |
| UO-31 | 0.65 | 0.55 | |
| non-small cell lung | A549_ATCC | 0.68 | 0.58 |
| EKVX | 0.62 | 0.53 | |
| HOP-18 | 0.54 | 0.48 | |
| HOP-62 | 0.65 | 0.59 | |
| HOP-92 | 0.60 | 0.57 | |
| LXFL-529 | 0.67 | 0.59 | |
| NCI-H226 | 0.64 | 0.54 | |
| NCI-H23 | 0.69 | 0.59 | |
| NCI-H322M | 0.64 | 0.67 | |
| NCI-H460 | 0.69 | 0.63 | |
| NCI-H522 | 0.67 | 0.57 | |
| small cell lung | DMS-114 | 0.67 | 0.58 |
| DMS-273 | 0.58 | 0.48 | |
| melanoma | LOX-IMVI | 0.67 | 0.59 |
| M14 | 0.66 | 0.58 | |
| M19-MEL | 0.66 | 0.59 | |
| MALME-3 M | 0.64 | 0.59 | |
| MDA-MB-435 | 0.68 | 0.59 | |
| MDA-N | 0.66 | 0.61 | |
| SK-MEL-28 | 0.66 | 0.59 | |
| SK-MEL-2 | 0.65 | 0.56 | |
| SK-MEL-5 | 0.66 | 0.64 | |
| UACC-257 | 0.65 | 0.59 | |
| UACC-62 | 0.67 | 0.59 |
Figure 4pdCSM-cancer regression performance on cross-validation. Scatter plots between experimental and predicted GI50% values, given in −log10(molar), for each of the nine cancer cell line (panels) models are displayed. Pearson’s correlation coefficient (r) is shown for each scatter plot (in black for 90% of the data, after 10% of outliers (depicted in red) were removed).