| Literature DB >> 34234154 |
Georgios Aivaliotis1,2,3, Jan Palczewski1,2, Rebecca Atkinson2, Janet E Cade4, Michelle A Morris5,6,7.
Abstract
Survival analysis with cohort study data has been traditionally performed using Cox proportional hazards models. Random survival forests (RSFs), a machine learning method, now present an alternative method. Using the UK Women's Cohort Study (n = 34,493) we evaluate two methods: a Cox model and an RSF, to investigate the association between Body Mass Index and time to breast cancer incidence. Robustness of the models were assessed by cross validation and bootstraping. Histograms of bootstrap coefficients are reported. C-Indices and Integrated Brier Scores are reported for all models. In post-menopausal women, the Cox model Hazard Ratios (HR) for Overweight (OW) and Obese (O) were 1.25 (1.04, 1.51) and 1.28 (0.98, 1.68) respectively and the RSF Odds Ratios (OR) with partial dependence on menopause for OW and O were 1.34 (1.31, 1.70) and 1.45 (1.42, 1.48). HR are non-significant results. Only the RSF appears confident about the effect of weight status on time to event. Bootstrapping demonstrated Cox model coefficients can vary significantly, weakening interpretation potential. An RSF was used to produce partial dependence plots (PDPs) showing OW and O weight status increase the probability of breast cancer incidence in post-menopausal women. All models have relatively low C-Index and high Integrated Brier Score. The RSF overfits the data. In our study, RSF can identify complex non-proportional hazard type patterns in the data, and allow more complicated relationships to be investigated using PDPs, but it overfits limiting extrapolation of results to new instances. Moreover, it is less easily interpreted than Cox models. The value of survival analysis remains paramount and therefore machine learning techniques like RSF should be considered as another method for analysis.Entities:
Year: 2021 PMID: 34234154 PMCID: PMC8263588 DOI: 10.1038/s41598-021-92944-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Histograms of coefficients of variable O (Obese) for Cox models trained on 100 bootstrap samples, illustrating that the distribution of the coefficients agrees with the reported confidence intervals for HRs.
Summary of Cox and RSF models reporting HRs and ORs for breast cancer incidence by weight status, when compared to normal weight status.
| Cox proportional hazards | RSF | |||
|---|---|---|---|---|
| HR (95% CI) | C-Index/IBS | OR (95% CI) | C-Index/IBS | |
| UW | 0.84 (0.51, 1.36) | 0.57/0.025 | 1.10 (1.08, 1.12) | 0.53/0.021 |
| OW | 1.13 (0.98, 2.30) | 1.24 (1.21, 1.27) | ||
| O | 1.16 (0.94, 1.42) | 1.36 (1.33, 1.39) | ||
| UW | 0.55 (0.23, 1.34) | 0.56/0.028 | 1.11 (1.09, 1.13) | |
| OW | 1.25 (1.04, 1.51) | 1.34 (1.31, 1.70) | ||
| O | 1.28 (0.98, 1.68) | 1.45 (1.42, 1.48) | ||
| UW | 0.50 (0.21, 1.22) | 0.57/0.025 | ||
| OW | 1.22 (1.02, 1.47) | |||
| O | 1.24 (0.96, 1.60) | |||
Figure 2(A) Partial dependence plot of the relationship between menopausal status and Breast cancer incidence over age. Note that the curves are crossing, an effect that cannot be modelled directly under the proportional hazards assumption; (B) Histogram of age by menopause status; (C) Partial dependence plot of the relationship between weight status and Breast cancer incidence over age, illustrating OW and O have increased probability of incidence for all ages; (D) Histogram of age by weight status.
Figure 3Boxplot showing the c-indices for three Cox Proportional Hazard Models (all data with an interaction term, Post-menopausal women only, the full sample) and the Random Survival Forest.