| Literature DB >> 34188098 |
Luís A Vale-Silva1, Karl Rohr2.
Abstract
The age of precision medicine demands powerful computational techniques to handle high-dimensional patient data. We present MultiSurv, a multimodal deep learning method for long-term pan-cancer survival prediction. MultiSurv uses dedicated submodels to establish feature representations of clinical, imaging, and different high-dimensional omics data modalities. A data fusion layer aggregates the multimodal representations, and a prediction submodel generates conditional survival probabilities for follow-up time intervals spanning several decades. MultiSurv is the first non-linear and non-proportional survival prediction method that leverages multimodal data. In addition, MultiSurv can handle missing data, including single values and complete data modalities. MultiSurv was applied to data from 33 different cancer types and yields accurate pan-cancer patient survival curves. A quantitative comparison with previous methods showed that Multisurv achieves the best results according to different time-dependent metrics. We also generated visualizations of the learned multimodal representation of MultiSurv, which revealed insights on cancer characteristics and heterogeneity.Entities:
Year: 2021 PMID: 34188098 PMCID: PMC8242026 DOI: 10.1038/s41598-021-92799-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1The MultiSurv model architecture. Input data are all from the NCI Genomic Data Commons database, including up to six different data modalities. Each data modality is handled by a dedicated DL submodel, trained to generate modality-specific feature representations. A data fusion layer combines the generated feature representation vectors into a single fused representation. A final neural network takes the fused feature representation and outputs a conditional survival probability for each of a set of pre-defined follow-up time intervals. Taking the cumulative product of the set of conditional survival probabilities produces the predicted survival curve. CNN convolutional neural network, FC fully-connected neural network.
Method performance with unimodal data inputs.
| Metric | Data | Method | ||||
|---|---|---|---|---|---|---|
| CPH[ | RSF[ | DeepSurv[ | DeepHit[ | MultiSurv | ||
| Ctd | Clinical | 0.770 (0.751–0.789) | 0.792 (0.773–0.810) | |||
| mRNA | 0.733 (0.712–0.755) | 0.719 (0.695–0.741) | 0.746 (0.722–0.768) | |||
| DNAm | 0.729 (0.709–0.752) | 0.737 (0.715–0.758) | 0.736 (0.714–0.759) | |||
| miRNA | 0.676 (0.651–0.700) | 0.664 (0.639–0.689) | 0.685 (0.661–0.711) | |||
| CNV | 0.570 (0.543–0.599) | 0.596 (0.571–0.621) | 0.575 (0.549–0.599) | |||
| WSI | – | – | – | – | ||
| IBS | Clinical | 0.184 (0.179-0.191) | ||||
| mRNA | 0.191 (0.181–0.200) | 0.180 (0.159–0.198) | 0.191 (0.180–0.198) | |||
| DNAm | 0.179 (0.165–0.192) | 0.186 (0.176-0.192) | 0.194 (0.179–0.208) | |||
| miRNA | 0.193 (0.183-0.201) | 0.194 (0.170–0.219) | ||||
| CNV | 0.217 (0.208–0.225) | 0.229 (0.215–0.247) | 0.217 (0.212–0.224) | |||
| WSI | – | – | – | – | ||
CPH Cox proportional hazards, RSF random survival forest, Clinical tabular clinical data, mRNA gene expression, DNAm DNA methylation, miRNA microRNA expression, CNV gene copy number variation, WSI whole-slide images.
Time-dependent concordance index (Ctd) and integrated Brier score (IBS) with 95% bootstrap confidence interval (CI; numbers in parentheses).
The best and second best results for each metric for each data modality are boldfaced and italics, respectively.
Model performance using a selection of combinations of the six input data modalities.
| Included data modalities | Ctd (95% CI) | IBS (95% CI) | |||||
|---|---|---|---|---|---|---|---|
| Clinical | mRNA | DNAm | miRNA | CNV | WSI | ||
| 0.808 (0.791–0.826) | |||||||
| 0.792 (0.775–0.810) | 0.147 (0.136–0.161) | ||||||
| 0.795 (0.778–0.812) | 0.140 (0.131–0.152) | ||||||
| 0.801 (0.783–0.817) | 0.148 (0.140–0.158) | ||||||
| 0.146 (0.135–0.158) | |||||||
| 0.798 (0.781–0.815) | 0.153 (0.139–0.168) | ||||||
| 0.802 (0.748–0.820) | 0.149 (0.136–0.162) | ||||||
| 0.787 (0.769–0.806) | 0.152 (0.140–0.166) | ||||||
Clinical tabular clinical data, mRNA gene expression, DNAm DNA methylation, miRNA microRNA expression, CNV gene copy number variation, WSI whole-slide images.
Individual data modalities included in each evaluated model are marked with . The best and second best results for each metric are boldfaced and italics, respectively.
Figure 2MultiSurv predictions allow construction of accurate survival curves. MultiSurv outputs patient survival predictions for the defined discrete-time follow up intervals. These can then be averaged to obtain group-wide survival predictions. (a) Survival curves constructed using Multisurv predictions for each patient in the test dataset diagnosed with one of four selected cancer types. One example patient is highlighted for each cancer type and the corresponding last follow up time point is annotated (as “Last follow up” if the patient is censored or “Death” if it corresponds to patient death). Highlighted patient codes are TCGA-HI-7169 for PRAD, TCGA-B0-5691 for KIRC, TCGA-29-1762 for OV, and TCGA-19-1390 for GBM. (b) Survival curves for the four example cancer types in (a) compared with Kaplan–Meier estimator outputs. (c) Survival curves for all patients in the test dataset compared with the Kaplan–Meier estimator output. (d) MultiSurv predictions allow accurate stratification of patient risk groups. Patients were split into low and high-risk groups according to MultiSurv’s first output interval risk prediction using the median value across all patients as the threshold. The two resulting groups have significantly different Kaplan–Meier estimates (log-rank test). The plot shows MultiSurv prediction averages overlayed on the Kaplan–Meier estimators. PRAD prostate adenocarcinoma, KIRC kidney renal clear cell carcinoma, OV ovarian serous cystadenocarcinoma, GBM glioblastoma multiforme.
Figure 3Visualization of feature representations learned by MultiSurv. We collected the internal fused feature representation vector of the MultiSurv model trained on clinical and mRNA data and embedded it into a two-dimensional space using t-SNE. (a) Embedded feature representations for each patient in the test dataset. Patients diagnosed with each of four selected cancer types are highlighted. Within each of the highlighted cancer types, visually selected outlier patients are annotated. All patient survival curves, highlighting the selected outliers, are displayed for (b) PRAD, (c) KIRC, (d) GBM, and (e) OV. PRAD prostate adenocarcinoma, KIRC kidney renal clear cell carcinoma, OV ovarian serous cystadenocarcinoma, GBM glioblastoma multiforme.
Summary information of the different data modalities after preprocessing.
| Modality | No. patients | No. features | |
|---|---|---|---|
| Continuous | Categorical | ||
| Clinical | 11,081 | 1 | 9 |
| mRNA | 9605 | 1000 | – |
| DNAm | 10,257 | 5000 | – |
| miRNA | 9616 | 1881 | – |
| CNV | 10,325 | – | 2000 |
| WSI | 8376 | 299 | – |