| Literature DB >> 35509973 |
Kristina Thedinga1, Ralf Herwig1.
Abstract
Cancer survival prediction is typically done with uninterpretable machine learning techniques, e.g., gradient tree boosting. Therefore, additional steps are needed to infer biological plausibility of the predictions. Here, we describe a protocol that combines pan-cancer survival prediction with XGBoost tree-ensemble learning and subsequent propagation of the learned feature weights on protein interaction networks. This protocol is based on TCGA transcriptome data of 8,024 patients from 25 cancer types but can easily be adapted to cancer patient data from other sources. For complete details on the use and execution of this protocol, please refer to Thedinga and Herwig (2022).Entities:
Keywords: Bioinformatics; Cancer; Genomics; Health Sciences; RNAseq; Systems biology
Mesh:
Year: 2022 PMID: 35509973 PMCID: PMC9059156 DOI: 10.1016/j.xpro.2022.101353
Source DB: PubMed Journal: STAR Protoc ISSN: 2666-1667
Figure 1Survival prediction performances
The pan-cancer XGBoost survival prediction performance (depicted as C-Index boxplots) from (Thedinga and Herwig, 2022) for 100 replications of model training on 25 TCGA cancer cohorts.
Figure 2Pan-cancer survival network module
Largest network module identified by NetCore (Barel and Herwig, 2020) network propagation and module identification based on pan-cancer important features identified in (Thedinga and Herwig, 2022) from 100 replications of XGBoost model training. Orange nodes correspond to seed genes, while genes that were inferred during network propagation are colored in gray. Figure reprinted with permission from Thedinga and Herwig (2022).
Figure 3Over-representation analysis with ConsensusPathDB
Red numbers (1–8) illustrate the steps necessary to perform an over-representation analysis of the module genes identified during network propagation using the ConsensusPathDB (Herwig et al., 2016) ORA implementation.
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| HTSeq-FPKM gene expression and clinical files | The Cancer Genome Atlas (TCGA); available through GDC data portal or TCGAbiolinks Bioconductor/R package ( | GDC data portal: |
| ConsensusPathDB (CPDB) protein-protein interaction network (version 34) | ( | |
| Python 3.7 | Python Software Foundation | |
| NumPy (version 1.18.5) Python package | ( | |
| Pandas (version 1.1.5) Python package | ( | |
| tqdm (version 4.38.0) Python package | ( | |
| SciPy (version 1.2.1) Python package | ( | |
| Matplotlib (version 3.1.1) Python package | ( | |
| scikit-learn (version 0.22.2.post1) Python package | ( | |
| Seaborn (version 0.9.0) Python package | ( | |
| NetworkX (version 2.3) Python package | ( | |
| XGBoost (version 0.90) Python package | ( | |
| MyGene (version 3.1.0) Python package | ( | |
| R 3.6.3 | The R Foundation | |
| Bioconductor (version 3.10) R package | ( | |
| TCGAbiolinks (version 2.12.6) Bioconductor/R package | ( | |
| optparse (version 1.6.6) R package | The Comprehensive R Archive Network (CRAN) | |
| dplyr (version 1.0.0) R package | ( | |
| reshape2 (version 1.4.4) R package | ( | |
| rjson (version 0.2.20) R package | ( | |
| ggplot2 (version 3.3.1) R package | ( | |
| ggpubr (version 0.2.5) R package | ( | |
| Git version control system | Software Freedom Conservancy | |
| Pan-cancer XGBoost survival prediction | ( | |
| Code for processing intermediate results | This Paper; ( | |
| NetCore | ( | |
| CPDB over-representation analysis | ( | |