| Literature DB >> 35106465 |
Kristina Thedinga1, Ralf Herwig1.
Abstract
Predicting cancer survival from molecular data is an important aspect of biomedical research because it allows quantifying patient risks and thus individualizing therapy. We introduce XGBoost tree ensemble learning to predict survival from transcriptome data of 8,024 patients from 25 different cancer types and show highly competitive performance with state-of-the-art methods. To further improve plausibility of the machine learning approach we conducted two additional steps. In the first step, we applied pan-cancer training and showed that it substantially improves prognosis compared with cancer subtype-specific training. In the second step, we applied network propagation and inferred a pan-cancer survival network consisting of 103 genes. This network highlights cross-cohort features and is predictive for the tumor microenvironment and immune status of the patients. Our work demonstrates that pan-cancer learning combined with network propagation generalizes over multiple cancer types and identifies biologically plausible features that can serve as biomarkers for monitoring cancer survival.Entities:
Keywords: Bioinformatics; Cancer systems biology; Mathematical biosciences
Year: 2021 PMID: 35106465 PMCID: PMC8786644 DOI: 10.1016/j.isci.2021.103617
Source DB: PubMed Journal: iScience ISSN: 2589-0042
Figure 1Single-cohort prediction performances
(A) C-Index boxplots over 100 replications of model training for random survival forest (RF), survival SVM (SVM), the Path2Surv multiple-kernel learning on the Hallmark gene sets (MKL[H]) and the Pathway Interaction Database (MKL[P]), and the single-cohort XGBoost method (XGB[SINGLE]) on 25 different TCGA cancer cohorts. Mean C-Indices were compared with Wilcoxon’s unpaired rank-sum test and significance levels are defined as , , , , .
(B) Spearman correlations between predictions of the different methods for test patients from the cohorts TCGA-BLCA (left) and TCGA-UVM (right). Larger circles correspond to a greater correlation, blue indicates a positive correlation and red indicates a negative correlation.
(C) Spearman correlation (R) between median C-Indices of single-cohort XGBoost predictions and median ages for 25 different TCGA cohorts. The blue line shows the linear regression fit to the data and the gray area indicates the 95% confidence interval.
Figure 2Pan-cancer XGBoost training improves over single-cohort training
(A) Histogram depicting fractions of gene features shared over different numbers of training cohorts (x axis: number of TCGA cohorts a gene feature is shared over; y axis: fraction of all 46,642 genes used in at least one single-cohort model).
(B) Prediction performances of the single-cohort XGBoost method (XGB[SINGLE]) and the pan-cancer XGBoost method (XGB[PAN]) on 25 different TCGA cancer cohorts, depicted as C-Index boxplots over 100 replications of model training. Mean C-Indices were compared with the Wilcoxon unpaired rank-sum test and significance levels are defined as , , , , . See also Figure S5.
(C) Venn-diagram comparing features used for prediction in the single-cohort XGBoost method (pink) with those selected in the pan-cancer XGBoost method (blue). See also Figure S1.
(D) Prediction performances (C-Indices) of single-cohort XGBoost (pink) and pan-cancer XGBoost (blue) for eight new cancer cohorts (not used in model training). For the single-cohort method, the mean C-Index over all 25 models trained on different TCGA cohorts is shown.
Figure 3Pan-cancer features are biologically plausible
(A) Weight distribution for the 100 genes with the highest feature importance (sums of feature importance scores over 100 model replications) for pan-cancer XGBoost training (gene identifiers that did not map to a Hugo symbol are named with their Ensembl identifiers). The different colors indicate gene types (blue: protein coding, orange: lncRNA, green: processed pseudogenes, purple: transcribed unprocessed pseudogene, red: gene type unknown). These genes types were obtained using the MyGene Python package (version 3.1, http://mygene.info) (Wu et al., 2013; Xin et al., 2016). See also Figure S3.
(B) Comparison of entropy distributions between the top 100 genes with the highest feature importance (feature importance is measured as sums of feature importance scores over 100 model replications) from the single-cohort approach and the pan-cancer approach (mean entropies are indicated as dashed lines). The entropy measure (xaxis) is based on the genes used in the single-cohort approach (cf. STAR Methods). The density of the entropy distribution is displayed on the yaxis.
(C) Kaplan-Meier plots for the four most important gene features from pan-cancer XGBoost and the cancer type with the lowest FDR-corrected p value in Cox regression, respectively. As a cutoff for gene expression, the 50th percentile was selected. Cox regression data and Kaplan-Meier plots were retrieved from OncoLnc (Anaya, 2016). See also Figure S2.
Figure 4Pan-cancer survival network
(A) Largest network module identified by NetCore (Barel and Herwig, 2020) network propagation and module identification based on pan-cancer important features. Orange nodes correspond to seed genes, while genes that were inferred during network propagation are colored in gray.
(B) Feature importance of the 103 module genes in single-cohort training (100 replications). Top: Sum of feature importance scores of the module genes per cohort. Bottom: Number of genes (of the 103 module genes) per cohort that are among the important features in single-cohort training (feature importance > 0). See also Tables S2 and S3.
Over-represented pathways (p < 0.001) computed with QIAGEN Ingenuity Pathway Analysis (IPA)
| Ingenuity canonical pathway | -log(p value) | Ratio | Molecules |
|---|---|---|---|
| Tumor microenvironment pathway | 9.34 | 6.25 × 10−2 | |
| Glucocorticoid receptor signaling | 8.60 | 3.25 × 10−2 | |
| Role of tissue factor in cancer | 8.55 | 7.76 × 10−2 | |
| Hepatic fibrosis signaling pathway | 6.85 | 3.17 × 10−2 | |
| Hepatic fibrosis/Hepatic stellate cell activation | 6.77 | 4.84 × 10−2 | |
| Coagulation system | 6.31 | 1.43 × 10−1 | |
| HOTAIR regulatory pathway | 6.18 | 5.00 × 10−2 | |
| Osteoarthritis pathway | 6.15 | 4.09 × 10−2 | |
| Growth hormone signaling | 6.08 | 8.45 × 10−2 | |
| Inhibition of matrix metalloproteases | 6.06 | 1.28 × 10−1 | |
| Glioma invasiveness signaling | 6.01 | 8.22 × 10−2 | |
| Reelin signaling in neurons | 5.87 | 5.74 × 10−2 | |
| Axonal Guidance signaling | 5.62 | 2.43 × 10−2 | |
| Estrogen receptor signaling | 5.62 | 3.05 × 10−2 | |
| Leukocyte extravasation signaling | 5.57 | 4.15 × 10−2 | |
| HIF1A signaling | 5.37 | 3.90 × 10−2 | |
| Semaphorin signaling in neurons | 5.12 | 8.33 × 10−2 | |
| Neuroinflammation signaling pathway | 5.05 | 3.00 × 10−2 | |
| Molecular mechanisms of cancer | 4.87 | 2.50 × 10−2 | |
| Tec kinase signaling | 4.86 | 4.05 × 10−2 | |
| p38 MAPK signaling | 4.80 | 5.08 × 10−2 | |
| Colorectal cancer metastasis signaling | 4.71 | 3.16 × 10−2 | |
| Caveolar-mediated endocytosis signaling | 4.70 | 6.85 × 10−2 | |
| Atherosclerosis signaling | 4.61 | 4.72 × 10−2 | |
| ERK/MAPK signaling | 4.43 | 3.47 × 10−2 | |
| Semaphorin neuronal repulsive signaling pathway | 4.39 | 4.32 × 10−2 | |
| Oncostatin M signaling | 4.37 | 9.30 × 10−2 | |
| Role of osteoblasts, osteoclasts and Chondrocytes in rheumatoid arthritis | 4.22 | 3.21 × 10−2 | |
| Sperm motility | 4.16 | 3.14 × 10−2 | |
| Bladder cancer signaling | 4.10 | 5.15 × 10−2 | |
| Cardiac hypertrophy signaling (enhanced) | 4.07 | 2.01 × 10−2 | |
| Role of macrophages, fibroblasts and endothelial cells in rheumatoid arthritis | 4.05 | 2.55 × 10−2 | |
| Chronic myeloid leukemia signaling | 3.98 | 4.85 × 10−2 | |
| Insulin secretion signaling pathway | 3.91 | 2.87 × 10−2 | |
| CNTF signaling | 3.89 | 7.02 × 10−2 | |
| T cell exhaustion signaling pathway | 3.84 | 3.43 × 10−2 | |
| Regulation of the epithelial mesenchymal transition by growth factors pathway | 3.67 | 3.19 × 10−2 | |
| RhoGDI signaling | 3.66 | 3.17 × 10−2 | |
| IL-15 production | 3.65 | 4.13 × 10−2 | |
| Agranulocyte adhesion and diapedesis | 3.61 | 3.11 × 10−2 | |
| Senescence pathway | 3.60 | 2.55 × 10−2 | |
| Role of MAPK signaling in inhibiting the pathogenesis of influenza | 3.43 | 5.33 × 10−2 | |
| mTOR signaling | 3.41 | 2.86 × 10−2 | |
| Inhibition of angiogenesis by TSP1 | 3.32 | 8.82 × 10−2 | |
| MIF-mediated glucocorticoid regulation | 3.32 | 8.82 × 10−2 | |
| Necroptosis signaling pathway | 3.13 | 3.18 × 10−2 | |
| Cardiac hypertrophy signaling | 3.11 | 2.50 × 10−2 | |
| MIF regulation of innate immunity | 3.04 | 7.14 × 10−2 |
Pathway,annotated pathway name; -log(p value),-log of enrichment p value computed with Fisher's exact test; ratio,proportion of genes in the network module that map to the respective pathway and overall number of genes in the pathway; molecules, network module genes that overlap with the pathway.
Figure 5Association of the pan-cancer survival network with immune subtypes. Principal component analysis (PCA) of the patients that can be assigned to an immune subtype according to (Thorsson et al., 2018). The PCA is based on the 103 module genes and patients are colored by their assigned immune subtype. PCA is generated with the R library ggplot2 (Wickham, 2009).
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| HTSeq-FPKM gene expression and clinical files | The Cancer Genome Atlas (TCGA) | |
| ConsensusPathDB (CPDB) protein-protein interaction network (version 34) | ||
| Known and candidate cancer genes from the network of cancer genes (NCG) | ||
| NetCore | ||
| OncoLnc | ||
| Single-cohort and pan-cancer XGBoost survival prediction | This Paper | |
| R Implementations of Path2Surv (MKL[H] and MKL[P]), random survival forest (RF), and survival support vector machine (SVM) | ||
| XGBoost python package | ||
| MyGene python package | ||
| QIAGEN ingenuity pathway analysis (QIAGEN IPA) | ||