| Literature DB >> 30455569 |
Melissa Zhao1, Yushi Tang1, Hyunkyung Kim1, Kohei Hasegawa1,2.
Abstract
OBJECTIVE: Despite existing prognostic markers, breast cancer prognosis remains a difficult subject due to the complex relationships between many contributing factors and survival. This study seeks to integrate multiple clinicopathological and genomic factors with dimensional reduction across machine learning algorithms to compare survival predictions.Entities:
Keywords: Breast cancer; machine learning methods; prediction; survival
Year: 2018 PMID: 30455569 PMCID: PMC6238199 DOI: 10.1177/1176935118810215
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1.Study scheme. KNN indicates K-nearest-neighbor; METABRIC, Molecular Taxonomy of Breast Cancer International Consortium; ROC, receiver operating characteristic; SVM, support vector machine.
Summary of patient characteristics.
| Patient characteristics | Overall (n = 1874) |
|---|---|
| Age (year), median (IQR) | 62 (51-71) |
| NPI, median | 4 |
| Menopause | 78.3% |
| ER positive status | 76.6% |
| PR positive status | 53.0% |
| HER2-positive status | 12.4% |
| Three genes status, mode (%)[ | 3 (32.3%) |
| Claudin subtypes, mode (%)[ | 3 (35.7%) |
| Chemotherapy | 20.8% |
| Hormonal therapy | 61.7% |
| Radiotherapy | 60.3% |
| Surgery, mode (%) | 1 (59.1%) |
| Tumor size (mm), median (IQR) | 51.0 (30.3-63.0) |
| Tumor grade, median | 2 |
| Tumor stage, median | 3 |
| Laterality, mode (%) | −1 (49.3%) |
| Cellularity, median | 2 |
| Oncotree code, mode (%)[ | 4 (79.1%) |
| Outcomes | |
| 5-year survival | 75.2% |
| 10-year survival | 47.7% |
| 15-year survival | 26.4% |
Table includes all clinicopathological features used in machine learning models.
Abbreviations: ER, estrogen receptor; HER2, human epidermal growth factor receptor; IQR, interquartile range; METABRIC, Molecular Taxonomy of Breast Cancer International Consortium; NPI, Nottingham Prognostic Index; PR, progesterone receptor.
Three genes status: ER, HER2, and Aurora kinase A (AURKA) activity.
Claudin (PAM50) subtypes: luminal A, luminal B, HER2-enriched, basal-like, and Claudin-low.
Oncotree code: tumor types based on Oncotree reported in METABRIC dataset.
Figure 2.(A) Correlation plot of K-means clusters for one random run with INTCLUST5 and 5-year survival. (B) Heatmap K-means clustering of training set and KNN classification of validation set over 10 random runs. Colors indicate group number across runs. Cluster groups are stable, particular for the group with worst survival (Group 1 in Run 1), as most patients classified into one particular group will be clustered into the same group for across repeated runs.
Summary of model performances in terms of discrimination ability and accuracy for one run.
| Models | ROC (95% CI) | Accuracy (95% CI) |
|---|---|---|
| Gradient boosting | 0.669 (0.608, 0.730) | 0.697 (0.648, 0.743) |
| Random forest | 0.677 (0.617, 0.736) | 0.729 (0.681, 0.773) |
| SVM | 0.658 (0.596, 0.720) | 0.729 (0.681, 0.773) |
| ANN | 0.673 (0.611, 0.735) | 0.721 (0.672, 0.765) |
All models performed similarly across ROC and accuracy measures. See Supplementary Table 1 for performance across all runs.
Abbreviations: ANN, artificial neural network; CI, confidence interval; ROC, receiver operating characteristic; SVM, support vector machine.
Figure 3.(A) Area under ROC curve of all prediction models based on clinicopathological features and genomic clusters from gene expression data from one run. (B) Calibration slopes (CSs) of all models from one run. See Supplementary Figure 1 for CS graphs for all nine other runs. (C) ROC curve (with 95% CI in lighter colors) and (D) accuracy (with 95% CI in lighter colors) of all models for 10 random training/validation splits. All models performed similarly in terms of ROC and accuracy. Performance measures were stable over 10 random runs, with all methods predicting 5-year survival better than random. ROC indicates receiver operating characteristic.
Figure 4.(A), (B), (C), and (D) Sum of variable importance values for all variables across 10 random runs, by model: Gradient Boosting (A), Random Forest (B), ANN (C), and SVM (D). All models besides ANN consistently chose NPI as the most important variable. Other important variables include tumor size and stage, ER/PR/HER2 status, and breast surgery status. K-means cluster with the worst survival was moderately important across models except for ANN. ANN was the most unstable model in terms of the values of variable importance assigned to each variable across runs. The x-axis denotes the sum of variable importance values across 10 random runs and may not exceed 1000, which is the sum for a variable that was the most important through all runs of a model.
Figure 5.Boxplot of overall survival distribution by genomic clusters for one run (A) and survival curve of the same genomic clusters derived from gene expression (B). There was a significant difference in survival between the clusters (P < .001). (C) Summary of the top 11 differentially expressed genes, top gene ontology terms, and top Kyoto encyclopedia of genes and genomes (KEGG) pathways found from differentially expressed genes. (D) Network hub of the top 100 differentially expressed genes for highlighting ERB2 hub (left) and CLCA2 hub (right). Red dot indicates ERB2 (left) and CLCA2 (right); grey dots indicate genes connected to red dots via functional pathways.