| Literature DB >> 31979006 |
Victor Tkachev1, Maxim Sorokin1,2, Constantin Borisov3, Andrew Garazha1, Anton Buzdin1,2,4,5, Nicolas Borisov1,2,4.
Abstract
(1) Background: Machine learning (ML) methods are rarely used for an omics-based prescription of cancer drugs, due to shortage of case histories with clinical outcome supplemented by high-throughput molecular data. This causes overtraining and high vulnerability of most ML methods. Recently, we proposed a hybrid global-local approach to ML termed floating window projective separator (FloWPS) that avoids extrapolation in the feature space. Its core property is data trimming, i.e., sample-specific removal of irrelevant features. (2)Entities:
Keywords: bioinformatics; chemotherapy; machine learning; omics profiling; oncology; personalized medicine
Year: 2020 PMID: 31979006 PMCID: PMC7037338 DOI: 10.3390/ijms21030713
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Area under curve (AUC) (a–d), sensitivity (SN) (e–h) and specificity (SP) (i–l) calculated for treatment response classifiers for eleven non-equalized datasets. The classifiers were based on SVM (a,e,i), RF (b,f,j), binomial naïve Bayes (BNB) (c,g,k) and multi-layer perceptron (MLP) (d,h,l) machine learning (ML) methods. The color legend shows the absence or presence of FloWPS in the classifier analytic pipeline and the value of relative balance factor B. On each panel, each violin plot shows the distribution of values for eleven cancer datasets.
Performance metrics for seven ML methods with default settings for datasets with equal numbers of responders and non-responders.
| ML Method | Method Type | Median AUC without FloWPS | Median AUC with FloWPS | Paired | Advantage of FloWPS | Median SN at | Median SP at |
|---|---|---|---|---|---|---|---|
| SVM | Global | 0.74 | 0.80 | 1.3 × 10−5 | Yes | 0.45 | 0.42 |
| kNN | Local | 0.76 | 0.75 | 0.53 | No | 0.25 | 0.34 |
| RF | Global | 0.74 | 0.82 | 1.3 × 10-5 | Yes | 0.45 | 0.42 |
| RR | Local | 0.80 | 0.79 | 0.16 | No | 0.36 | 0.41 |
| BNB | Global | 0.77 | 0.82 | 2.7 × 10−4 | Yes | 0.51 | 0.58 |
| ADA | Global | 0.70 | 0.76 | 2.4 × 10−4 | Yes | 0.32 | 0.41 |
| MLP | Global | 0.73 | 0.82 | 6.4 × 10−5 | Yes | 0.53 | 0.53 |
Yes–FloWPS is beneficial for ML quality, No–FloWPS is not beneficial for ML quality.
Performance metrics for BNB, MLP and RF methods with the advanced settings for datasets with equal numbers of responders and non-responder samples.
| ML Method | Median AUC without FloWPS | Median AUC with FloWPS | Paired | Median SN at | Median SP at |
|---|---|---|---|---|---|
| RF | 0.75 | 0.83 | 3.5 × 10−6 | 0.50 | 0.56 |
| BNB | 0.78 | 0.83 | 6.7 × 10−4 | 0.50 | 0.60 |
| MLP | 0.77 | 0.84 | 2.4 × 10−4 | 0.50 | 0.51 |
Performance metrics for BNB, MLP, RF and SVM methods with the advanced settings for eleven datasets with variable numbers of responders and non-responder samples.
| Method | Median AUC without FloWPS | Median AUC with FloWPS | Paired | Median SN at | Median SP at |
|---|---|---|---|---|---|
| SVM | 0.81 | 0.83 | 0.013 | 0.65 | 0.70 |
| RF | 0.76 | 0.86 | 4.9 × 10−6 | 0.56 | 0.71 |
| BNB | 0.84 | 0.89 | 7.5 × 10−4 | 0.78 | 0.75 |
| MLP | 0.83 | 0.88 | 1.0 × 10−4 | 0.63 | 0.71 |
Median pairwise Pearson/Spearman correlation at feature (gene expression) importance (I) level. Figures above main diagonal: With FloWPS; figures below: Without FloWPS.
| SVM | RF | RR | BNB | MLP | |
|---|---|---|---|---|---|
| SVM | 1 | 0.53/0.55 | 0.40/0.39 | 0.37/0.34 | 0.46/0.46 |
| RF | 0.34/0.40 | 1 | 0.51/0.32 | 0.48/0.31 | 0.59/0.38 |
| RR | 0.19/0.14 | 0.35/0.04 | 1 | 0.93/0.79 | 0.89/0.52 |
| BNB | 0.24/0.14 | 0.33/0.09 | 0.88/0.64 | 1 | 0.81/0.46 |
| MLP | 0.33/0.30 | 0.40/0.17 | 0.76/0.06 | 0.61/0.12 | 1 |
Figure 2Schematic view of global-local order hybrid ML analytic pipeline (adopted after [8]; copyright belongs to the authors of [8], who wrote also the current paper). (a) Global machine learning methods may fail to separate classes for datasets without global order. (b) ML, coupled with FloWPS, works locally and handles that cases more accurately.
Minimal, median, mean and maximal Pearson/Spearman correlation values for pairwise comparison of different ML methods with FloWPS at the level of feature importance (I).
| Dataset # | Dataset ID | Min | Median | Mean | Max |
|---|---|---|---|---|---|
| 1 | GSE25066 | 0.41/0.28 | 0.72/0.44 | 0.67/0.46 | 0.93/0.81 |
| 2 | GSE41998 | −0.02/−0.10 | 0.55/0.39 | 0.49/0.35 | 0.87/0.83 |
| 3 | GSE9782 | 0.37/0.19 | 0.58/0.41 | 0.62/0.41 | 0.97/0.88 |
| 4 | GSE39754 | 0.34/0.28 | 0.50/0.37 | 0.54/0.41 | 0.84/0.72 |
| 5 | GSE68871 | 0.50/0.43 | 0.62/0.60 | 0.68/0.64 | 0.95/0.93 |
| 6 | GSE55145 | 0.32/0.29 | 0.57/0.42 | 0.60/0.45 | 0.85/0.70 |
| 7 | TARGET50 | 0.34/0.57 | 0.69/0.74 | 0.66/0.72 | 0.95/0.82 |
| 8 | TARGET10 | 0.32/0.30 | 0.50/0.45 | 0.58/0.48 | 0.90/0.77 |
| 9 | TARGET20 busulfan | 0.63/0.55 | 0.70/0.66 | 0.76/0.70 | 0.97/0.89 |
| 10 | TARGET20 no busulfan | 0.16/0.35 | 0.63/0.53 | 0.60/0.55 | 0.92/0.79 |
| 11 | GSE18728 | 0.38/0.21 | 0.54/0.46 | 0.62/0.45 | 0.95/0.79 |
| 12 | GSE20181 | 0.33/0.17 | 0.43/0.43 | 0.56/0.43 | 0.96/0.79 |
| 13 | GSE20194 | 0.06/0.04 | 0.50/0.30 | 0.49/0.34 | 0.93/0.80 |
| 14 | GSE23988 | 0.28/0.18 | 0.46/0.35 | 0.55/0.39 | 0.96/0.82 |
| 15 | GSE32646 | 0.23/0.11 | 0.37/0.28 | 0.49/0.32 | 0.95/0.74 |
| 16 | GSE37946 | 0.40/0.26 | 0.62/0.45 | 0.62/0.44 | 0.92/0.69 |
| 17 | GSE42822 | 0.34/0.03 | 0.52/0.40 | 0.58/0.38 | 0.89/0.82 |
| 18 | GSE5122 | 0.12/−0.06 | 0.40/0.20 | 0.46/0.25 | 0.93/0.79 |
| 19 | GSE59515 | 0.37/0.26 | 0.47/0.47 | 0.59/0.49 | 0.96/0.74 |
| 20 | TCGA-LGG | 0.27/0.13 | 0.64/0.47 | 0.63/0.42 | 0.94/0.76 |
| 21 | TCGA-LC | 0.44/0.23 | 0.62/0.55 | 0.66/0.53 | 0.95/0.90 |
Clinically annotated gene expression datasets used in this study.
| Reference | Dataset ID | Disease Type | Treatment | Experimental Platform | Number NC of Cases (R vs. NR) | Number S of Core Marker Genes |
|---|---|---|---|---|---|---|
| [ | GSE25066 | Breast cancer with different hormonal and HER2 status | Neoadjuvant taxane + anthracycline | Affymetrix Human Genome U133 Array | 235 (118 R: Complete response + partial response; 117 NR: Residual disease + progressive disease) | 20 |
| [ | GSE41998 | Breast cancer with different hormonal and HER2 status | Neoadjuvant doxorubicin + cyclophosphamide, followed by paclitaxel | Affymetrix Human Genome U133 Array | 68 (34 R: Complete response + partial response; 34 NR: Residual disease + progressive disease) | 11 |
| [ | GSE9782 | Multiple myeloma | Bortezomib monotherapy | Affymetrix Human Genome U133 Array | 169 (85 R: Complete response + partial response; 84 NR: No change + progressive disease) | 18 |
| [ | GSE39754 | Multiple myeloma | Vincristine + adriamycin + dexamethasone followed by autologous stem cell transplantation (ASCT) | Affymetrix Human Exon 1.0 ST Array | 124 (62 R: Complete, near-complete and very good partial responders, 62 NR: Partial, minor and worse) | 16 |
| [ | GSE68871 | Multiple myeloma | Bortezomib-thalido-mide-dexamethasone | Affymetrix Human Genome U133 Plus | 98 (49 R: Complete, near-complete and very good partial responders, 49 NR: Partial, minor and worse) | 12 |
| [ | GSE55145 | Multiple myeloma | Bortezomib followed by ASCT | Affymetrix Human Exon 1.0 ST Array | 56 (28 R: Complete, near-complete and very good partial responders, 28 NR: Partial, minor and worse) | 14 |
| [ | TARGET-50 | Pediatric kidney Wilms tumor | Vincristine sulfate + cyclosporine, cytarabine, daunorubicin + conventional surgery + radiation therapy | Illumina HiSeq 2000 | 72 (36 R, 36 NR: See Reference [ | 14 |
| [ | TARGET-10 | Pediatric acute lymphoblastic leukemia | Vincristine sulfate + carboplatin, cyclophosphamide, doxorubicin | Illumina HiSeq 2000 | 60 (30 R, 30 NR: See Reference [ | 14 |
| [ | TARGET-20 | Pediatric acute myeloid leukemia | Non-target drugs (asparaginase, cyclosporine, cytarabine, daunorubicin, etoposide; methotrexate, mitoxantrone), including busulfan and cyclophosphamide | Illumina HiSeq 2000 | 46 (23 R, 23 NR: See Reference [ | 10 |
| [ | TARGET-20 | Pediatric acute myeloid leukemia | Same non-target drugs, but excluding busulfan and cyclophosphamide | Illumina HiSeq 2000 | 124 (62 R, 62 NR: See Reference [ | 16 |
| [ | GSE18728 | Breast cancer | Docetaxel, capecitabine | Affymetrix Human Genome U133 Plus 2.0 Array | 61 (23R: Complete response + partial response; 38 NR: Residual disease + progressive disease) | 16 |
| [ | GSE20181 | Breast cancer | Letrozole | Affymetrix Human Genome U133A Array | 52 (37 R: Complete response + partial response; 15 NR: Residual disease + progressive disease) | 11 |
| [ | GSE20194 | Breast cancer | Paclitaxel; (tri)fluoroacetyl chloride; 5-fluorouracil, epirubicin, cyclophosphamide | Affymetrix Human Genome U133A Array | 52 (11 R: Complete response + partial response; 41 NR: Residual disease + progressive disease) | 10 |
| [ | GSE23988 | Breast cancer | Docetaxel, capecitabine | Affymetrix Human Genome U133A Array | 61 (20 R: Complete response + partial response; 41 NR: Residual disease + progressive disease) | 18 |
| [ | GSE32646 | Breast cancer | Paclitaxel, 5-fluorouracil, epirubicin, cyclophosphamide | Affymetrix Human Genome U133 Plus 2.0 Array | 115 (27 R: Complete response + partial response; 88 NR: Residual disease + progressive disease) | 17 |
| [ | GSE37946 | Breast cancer | Trastuzumab | Affymetrix Human Genome U133A Array | 50 (27 R: Complete response + partial response; 23 NR: Residual disease + progressive disease) | 14 |
| [ | GSE42822 | Breast cancer | Docetaxel, 5-fluorouracil, epirubicin, cyclophosphamide, capecitabine | Affymetrix Human Genome U133A Array | 91 (38 R: Complete response + partial response; 53 NR: Residual disease + progressive disease) | 13 |
| [ | GSE5122 | Acute myeloid leukemia | Tipifarnib | Affymetrix Human Genome U133A Array | 57 (13 R: Complete response + partial response + stable disease; 44 R: Progressive disease) | 10 |
| [ | GSE59515 | Breast cancer | Letrozole | Illumina HumanHT-12 V4.0 expression beadchip | 75 (51 R: Complete response + partial response; 24 NR: Residual disease + progressive disease) | 15 |
| [ | TCGA-LGG | Low-grade glioma | Temozolomide + (optionally) mibefradil | Illumina HiSeq 2000 | 131 (100 R: Complete response + partial response + stable disease; 31 NR: Progressive disease) | 9 |
| [ | TCGA-LC | Lung cancer | Paclitaxel + (optionally),cisplatin/carboplatin, reolysin | Illumina HiSeq 2000 | 41 (24 R: Complete response + partial response + stable disease; 17 NR: Progressive disease) | 7 |
Figure 3Outline of floating window projective separator (FloWPS) approach. Selection of relevant features (a) and nearest neighbors (b) are schematized.
Figure 4The algorithm of data trimming used for binomial naïve Bayes (LOO) cross-validation of the clinically annotated gene expression datasets. Indexes i and j denote samples (patients), index s denotes pairs of (m0,k0)-values in the prediction-accountable set, and indexes m and k denote the data trimming parameters.