| Literature DB >> 34965262 |
Mohanad Mohammed1,2, Innocent B Mboya1,3, Henry Mwambi1, Murtada K Elbashir4, Bernard Omolo1,5,6.
Abstract
Understanding and identifying the markers and clinical information that are associated with colorectal cancer (CRC) patient survival is needed for early detection and diagnosis. In this work, we aimed to build a simple model using Cox proportional hazards (PH) and random survival forest (RSF) and find a robust signature for predicting CRC overall survival. We used stepwise regression to develop Cox PH model to analyse 54 common differentially expressed genes from three mutations. RSF is applied using log-rank and log-rank-score based on 5000 survival trees, and therefore, variables important obtained to find the genes that are most influential for CRC survival. We compared the predictive performance of the Cox PH model and RSF for early CRC detection and diagnosis. The results indicate that SLC9A8, IER5, ARSJ, ANKRD27, and PIPOX genes were significantly associated with the CRC overall survival. In addition, age, sex, and stages are also affecting the CRC overall survival. The RSF model using log-rank is better than log-rank-score, while log-rank-score needed more trees to stabilize. Overall, the imputation of missing values enhanced the model's predictive performance. In addition, Cox PH predictive performance was better than RSF.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34965262 PMCID: PMC8716055 DOI: 10.1371/journal.pone.0261625
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Flow-chart of the procedure followed in the pre-processing and analysis of the dataset.
Clinical characteristics of colorectal cancer patients (N = 307).
| Variable | Frequency (n) | Percentage (%) |
|---|---|---|
| 66.8 (13.2) | ||
|
| ||
| Mutant | 123 | 40 |
| WildType | 184 | 60 |
|
| ||
| Mutant | 25 | 8 |
| WildType | 282 | 92 |
|
| ||
| Mutant | 166 | 54 |
| WildType | 141 | 46 |
|
| ||
| Proximal | 124 | 40 |
| Distal | 183 | 60 |
|
| ||
| Early | 156 | 51 |
| Late | 151 | 49 |
|
| ||
| Female | 137 | 45 |
| Male | 170 | 55 |
|
| ||
| C1 | 65 | 21 |
| C2 | 49 | 16 |
| C3 | 43 | 14 |
| C4 | 29 | 9 |
| C5 | 29 | 9 |
| C6 | 36 | 12 |
*SD: Standard deviation
Fig 2Proportion and patterns of missing values in the clinical characteristics available in the GSE39582 dataset.
Testing the proportional hazard assumption using scaled Schoenfeld residuals.
| Probeset ID (Symbol) | χ2 | p-value |
|---|---|---|
| 204014_at (DUSP4) | 10.219 (1) | 0.0014 |
| 212947_at (SLC9A8) | 1.345 (1) | 0.2462 |
| 218611_at (IER5) | 2.045 (1) | 0.1527 |
| 219973_at (ARSJ) | 3.601 (1) | 0.0577 |
| 221522_at (ANKRD27) | 1.583 (1) | 0.2083 |
| 221605_s_at (PIPOX) | 1.651 (1) | 0.1988 |
| 227134_at (SYTL1) | 4.699 (1) | 0.0302 |
| Age at diagnosis (years) | 2.589 (1) | 0.1076 |
| Molecular subtype | 15.824 (5) | 0.0074 |
| Disease stages | 1.173 (1) | 0.2787 |
| Sex | 0.378 (1) | 0.5388 |
| Tumor location | 0.951 (1) | 0.3294 |
*Chi-square statistic
Multivariable Cox PH results for predictors of colorectal cancer survival among adults aged 24 years and above.
| Probeset ID (Symbol) / Variables | Before imputation (N = 307) | After imputation (N = 566) | ||||
|---|---|---|---|---|---|---|
| HR* (SE) | 95%CI | P-value | HR* (SE) | 95%CI | P-value | |
|
| 0.09 (0.84) | (0.02, 0.49) | 0.005** | 0.30 (0.66) | (0.08, 1.07) | 0.066 |
|
| 9.51 (1.02) | (1.30, 69.58) | 0.027* | 6.48 (0.79) | (1.37, 30.53) | 0.019* |
|
| 0.23 (0.48) | (0.09, 0.58) | 0.002** | 0.44 (0.36) | (0.22, 0.89) | 0.024* |
|
| 34.89 (1.48) | (1.91, 635.90) | 0.016* | 2.49 (1.06) | (0.31, 19.95) | 0.393 |
|
| 0.43 (0.34) | (0.22, 0.85) | 0.014* | 0.49 (0.27) | (0.28, 0.83) | 0.009** |
|
| 1.03 (0.01) | (1.01, 1.05) | 0.001*** | 1.03 (0.01) | (1.01, 1.04) | <0.000*** |
|
| ||||||
| Female | 1.00 | 1.00 | ||||
| Male | 1.23 (0.20) | (0.84, 1.81) | 0.281 | 1.40 (0.15) | (1.05, 1.88) | 0.024 |
|
| ||||||
| Early | 1.00 | 1.00 | ||||
| Late | 1.97 (0.20) | (1.33, 2.93) | 0.001*** | 1.96 (0.15) | (1.47, 2.63) | <0.000*** |
|
| ||||||
| Proximal | 1.00 | 1.00 | ||||
| Distal | 1.06 (0.21) | (0.71, 1.58) | 0.783 | 0.86 (0.16) | (0.63, 1.18) | 0.356 |
HR: Hazard ratio, SE: Standard error, adjusted for 212947_at, 218611_at, 219973_at, 221522_at, 221605_s_at, age at first diagnosis, sex, disease stage, and tumor location.
Random survival forests results before and after imputation using log-rank and log-rank-score split rules.
| Before imputation (N = 246) | After imputation (N = 453) | |||
|---|---|---|---|---|
| Log-rank | Log-rank-score | Log-rank | Log-rank-score | |
| Number of deaths | 88 | 88 | 157 | 157 |
| Number of trees | 5000 | 5000 | 5000 | 5000 |
| Forest terminal node size | 15 | 15 | 15 | 15 |
| Average no. of terminal nodes | 13.58 | 11.92 | 25.34 | 22.14 |
| No. of variables tried at each split | 8 | 8 | 8 | 8 |
| Total no. of variables | 62 | 62 | 62 | 62 |
| Resampling used to grow trees | swor | swor | swor | swor |
| Resample size used to grow trees | 155 | 155 | 286 | 286 |
| Analysis | RSF | RSF | RSF | RSF |
| Family | surv | surv | surv | surv |
| Splitting rule | log-rank | log-rank-score | log-rank | log-rank-score |
| Number of random split points | 10 | 10 | 10 | 10 |
| Error rate | 41.26% | 49.05% | 33.22% | 43.01% |
* Analysis performed using the 80% training set
Fig 3The prediction error rate for the random survival forests of 5000 trees before imputation and the log-rank and log-rank-score in the left and right panel used 80% training dataset.
Fig 5The rank of most predictive genes and clinical variables for colorectal cancer patients’ survival before the imputation is based on how they influence the survival outcome.
The variables importance is built using log-rank and log-rank-score split-rules in the left and right panel, respectively.
Fig 6The rank of most predictive genes and clinical variables for colorectal cancer patients’ survival after the imputation is based on how they influence the survival outcome.
The variables importance is built using log-rank and log-rank-score split-rules in the left and right panel, respectively.
Fig 8RSF with (log-rank and log-rank score) and Cox PH boxplot prediction error using 20% testing set together with the complete case dataset and the imputed data.
Comparison of the models using the integrated brier scores.
| Methods | Before Imputation | After Imputation |
|---|---|---|
| Kaplan Meier | 0.199 | 0.201 |
| RSF (Log-rank) | 0.192 | 0.198 |
| RSF (Log-rank score) | 0.198 | 0.202 |
| Cox PH | 0.228 | 0.212 |
Summary of the filtered datasets and the pre-processing steps.
| Dataset (GSE39582) | Number of samples | Complete cases | Common samples | Total number of genes | After filtration | Uncorrelated genes | DEGs (t-test) | Common genes | |
|---|---|---|---|---|---|---|---|---|---|
|
| KRAS | 585 | 545 | 307 | 54675 | 18865 | 13827 | 711 | 54 |
| BRAF | 512 | 2388 | |||||||
| TP53 | 351 | 629 | |||||||
* Three datasets with the same covariates and different clinical outcome