| Literature DB >> 30972101 |
Xiaolu Xu1, Hong Gu1, Yang Wang2, Jia Wang3, Pan Qin1.
Abstract
Anticancer drug responses can be varied for individual patients. This difference is mainly caused by genetic reasons, like mutations and RNA expression. Thus, these genetic features are often used to construct classification models to predict the drug response. This research focuses on the feature selection issue for the classification models. Because of the vast dimensions of the feature space for predicting drug response, the autoencoder network was first built, and a subset of inputs with the important contribution was selected. Then by using the Boruta algorithm, a further small set of features was determined for the random forest, which was used to predict drug response. Two datasets, GDSC and CCLE, were used to illustrate the efficiency of the proposed method.Entities:
Keywords: anticancer drug response; autoencoder; classification model; feature selection; random forest
Year: 2019 PMID: 30972101 PMCID: PMC6445890 DOI: 10.3389/fgene.2019.00233
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Flowchart of AutoBorutaRF for predicting anticancer drug response, which includes three parts: (A) data preprocessing, (B) feature selection, and (C) classifier constructing.
Total numbers of samples for three features.
| GDSC | Raw | 139 | 1,124 | 11,833 (789) | 70 (778) | 24,960 (936) |
| Preprocessed | 98 | 555 | 11,712 (555) | 54 (555) | 24,959 (555) | |
| CCLE | Raw | 24 | 1,061 | 20,049 (1,028) | 1,667 (1,044) | 24,960 (742) |
| Preprocessed | 24 | 363 | 19,389 (363) | 1,667 (363) | 24,960 (363) |
The number in the parenthesis means a total of cell lines corresponding to the features.
Figure 2Histograms of drug responses for 12 drugs in GDSC. The distributions of drug responses were different for various drugs.
Mean values of six evaluation metrics obtained from GDSC.
| AutoBorutaRF | 0.6542 | |||||
| Naive Bayes | 0.6792 | 0.6109 | 0.4242 | 0.4947 | 0.2475 | |
| SVM-RFE | 0.5159 | 0.5945 | 0.5797 | 0.6092 | 0.5855 | 0.1915 |
| FSelector | 0.6477 | 0.6061 | 0.6171 | 0.5952 | 0.6068 | 0.2155 |
| AutoHidden | 0.6095 | 0.5780 | 0.5576 | 0.5984 | 0.5651 | 0.1584 |
The bold number indicates the best result.
Mean values of six evaluation metrics obtained from CCLE.
| AutoBorutaRF | 0.8137 | |||||
| Naive Bayes | 0.7793 | 0.6838 | 0.3325 | 0.9194 | 0.3662 | 0.2759 |
| SVM-RFE | 0.5516 | 0.7287 | 0.4286 | 0.8129 | 0.5239 | 0.2961 |
| FSelector | 0.7372 | 0.7430 | 0.5061 | 0.8058 | 0.5639 | 0.3535 |
| AutoHidden | 0.7063 | 0.6970 | 0.1338 | 0.3567 | 0.2198 |
The bold number indicates the best result.
Figure 3Box plots of the six evaluation metrics overall the cell lines in the (A) GDSC and (B) CCLE datasets. Our method was of the best performance with respect to AUC, accuracy, recall, specificity, F1 score, and Matthews correlation coefficient. The naive Bayes classifier and SVM-RFE outperformed at specificity.
Figure 4Prediction performance for the lung cell lines in GDSC. (A) Box plots of six metrics overall the lung cells showed the satisfying prediction performance. (B) Histogram of p-values obtained by the statistical significance test for the identified features proved that most of the identified features were of significantly different genetic profiles between the sensitive and non-sensitive populations.
Figure 5Performance metrics of AutoBorutaRF overall the lung cell lines in GDSC for PLX4720 and BIBW2992.