Qian Liu1,2,3, Pingzhao Hu1,2. 1. Department of Biochemistry and Medical Genetics, University of Manitoba, Winnipeg, Manitoba R3E 0W3, Canada. 2. Department of Computer Science, University of Manitoba, Winnipeg, Manitoba R3E 0W3, Canada. 3. Department of Statistics, University of Manitoba, Winnipeg, Manitoba R3E 0W3, Canada.
Abstract
In precise medicine, it is with great value to develop computational frameworks for identifying prognostic biomarkers which can capture both multi-genomic and phenotypic heterogeneity of breast cancer (BC). Radiogenomics is a field where medical images and genomic measurements are integrated and mined to solve challenging clinical problems. Previous radiogenomic studies suffered from data incompleteness, feature subjectivity and low interpretability. For example, the majority of the radiogenomic studies miss one or two of medical imaging data, genomic data, and clinical outcome data, which results in the data incomplete issue. Feature subjectivity issue comes from the extraction of imaging features with significant human involvement. Thus, there is an urgent need to address above-mentioned limitations so that fully automatic and transparent radiogenomic prognostic biomarkers could be identified for BC. We proposed a novel framework for BC prognostic radiogenomic biomarker identification. This framework involves an explainable DL model for image feature extraction, a Bayesian tensor factorization (BTF) processing for multi-genomic feature extraction, a leverage strategy to utilize unpaired imaging, genomic, and survival outcome data, and a mediation analysis to provide further interpretation for identified biomarkers. This work provided a new perspective for conducting a comprehensive radiogenomic study when only limited resources are given. Compared with baseline traditional radiogenomic biomarkers, the 23 biomarkers identified by the proposed framework performed better in indicating patients' survival outcome. And their interpretability is guaranteed by different levels of build-in and follow-up analyses.
In precise medicine, it is with great value to develop computational frameworks for identifying prognostic biomarkers which can capture both multi-genomic and phenotypic heterogeneity of breast cancer (BC). Radiogenomics is a field where medical images and genomic measurements are integrated and mined to solve challenging clinical problems. Previous radiogenomic studies suffered from data incompleteness, feature subjectivity and low interpretability. For example, the majority of the radiogenomic studies miss one or two of medical imaging data, genomic data, and clinical outcome data, which results in the data incomplete issue. Feature subjectivity issue comes from the extraction of imaging features with significant human involvement. Thus, there is an urgent need to address above-mentioned limitations so that fully automatic and transparent radiogenomic prognostic biomarkers could be identified for BC. We proposed a novel framework for BC prognostic radiogenomic biomarker identification. This framework involves an explainable DL model for image feature extraction, a Bayesian tensor factorization (BTF) processing for multi-genomic feature extraction, a leverage strategy to utilize unpaired imaging, genomic, and survival outcome data, and a mediation analysis to provide further interpretation for identified biomarkers. This work provided a new perspective for conducting a comprehensive radiogenomic study when only limited resources are given. Compared with baseline traditional radiogenomic biomarkers, the 23 biomarkers identified by the proposed framework performed better in indicating patients' survival outcome. And their interpretability is guaranteed by different levels of build-in and follow-up analyses.
Breast cancer (BC) is the most commonly diagnosed cancer and is one of the leading causes of cancer death for women worldwide [1]. It is an advanced solid tumor with very high heterogeneity that comes from a variety of cellular function gain and loss during the development of the tumor. The widely accepted cancer theory indicates that there might be ten large biological capabilities acquired during the course of human tumors [2]. These unnormal biological capabilities range from sustaining proliferative signaling to evading immune destruction. They influence intermediate phenotypes such as tumor morphology and then eventually change the clinical outcomes such as overall survival (OS). Therefore, it is critical to characterize the cellular heterogeneity comprehensively. Meanwhile, the intermediate tumor morphology is also worth of consideration in estimating the clinical outcome of patients. The cellular heterogeneity could be detected in different biological levels using variety of modern molecular biological techniques which could generate high-throughput measurements, such as gene expression values, copy number variation (CNV) scores, and DNA methylation levels. These measurements contain rich and valuable information about the molecular heterogeneity but are hard to be digested directly by human. A lot of computational tools have been developed to help human experts summary the heterogeneity of cancer cells from these high-dimensional molecular biology measurements [3]. The majority of them involve unsupervised matrix deconvolution in their workflow [4], [5]. One disadvantage of matrix deconvolution is that it cannot keep the inherent and complement information of different biological levels because matrix deconvolution method simply merges different molecular omics data matrix into a big data matrix without consider the interaction between them [6]. Furthermore, it might be difficult to establish biological interpretations for the variety of genomic factors calculated using the matrix deconvolutional operation [7].Recently, tensor decomposition has been introduced into multi-genomic data analysis [8]. Tensor is defined as a high-dimensional data array [9]. Several two-dimensional genomics data matrices can form a three-dimensional tensor with the new dimension representing the data sources (such as genotyping data, gene expressions, DNA methylation, et al). Then the tensor can be decomposed or factorized into factors with a reduced size [9], [10], [11]. These factors could represent the heterogeneity of the tumor and inform prognosis. Comparing with matrix deconvolution, tensor decomposition could take the cross-level interactions into consideration. What is more, the decomposed latent factors have patient-directional projections as well as gene-directional projections. Patient-directional factors could reflect the heterogeneity among subjects, while gene-directional projections could capture the contribution of each gene to each patient-directional factor. By analyzing the gene-directional projection matrix, key biological functions of each patient-directional factor could be estimated. Tensor decomposition is not a new topic and there are several algorithms to conduct this task. Among them, canonical decomposition (CANDECOM) [11] and parallel factors (PARAFAC) [10] are often referred together as CP (CANDECOM/ PARAFAC) because they both decompose a tensor as the sum of several rank-one tensors [12], [13], [14]. The number of these rank-one tensors, which is also called the rank of the given tensor, needs to be pre-defined. However, determining the rank of a given tensor is a non-deterministic polynomial time (NP) problem [15], [16], and for a long time, there had been no direct way to solve this problem [9]. Until recently, the emerge of Bayesian tensor factorization (BTF) algorithm provided a solution [17]. BTF first uses a multi-linear model to decompose the given tensor to latent factors, then performs Bayesian inference to estimate the posterior distribution of these latent factors. At the end is a filtering procedure to remove redundant factors. In this way, BTF could determine the optimum rank of a tensor and extract latent factors at the same time. Factors extracted by BTF performed better in many healthcare-related tasks comparing with the other tensor factorization algorithms [17].Although multi-genomic measurements could provide us with rich information about the cellular tumor heterogeneity, the genomic examination is invasive and sometime expensive. In addition, it may not be able to capture the dynamic and macroscopic information of the whole tumor as the biopsy is often taking at a certain time point on a small bulk of tumor tissues. Radiomics is a research field where high-throughput medical image features are used to describe disease phenotypes [18]. It could be used as an auxiliary or surrogate of the multi-genomic analysis. Medical imaging is non-invasive, so it is often used as a disease monitoring method and thus performed at multiple time points during the course of the BC. And the imaging region of interest (ROI) usually covers the entire tumor and even the tissues around the tumor. Current radiomic studies are facing feature subjectivity and interpretability trade-off issue. Traditional computational engineering methods usually involve human experts’ pre-processing which introduces the subjectivity. Although human understandable image features such as tumor morphological features and the first-order, second-order statistic features of the image pixel distribution [19] could be generated using the traditional feature engineering methods, they are pre-defined and limited by human knowledge therefore may not be able to fully represent the image heterogeneity. Recently, with the fast development of deep learning (DL) techniques, DL-based feature extraction approaches have been widely used in radiomics [20]. DL is highly flexible and accurate in analyzing multi-modal volumetric and dynamic medical images in a fully automatic and non-linear manner [21]. But image features extracted by sophisticated DL models are considered not human understandable. Therefore, it is critical to explore potential tools to increase their explainability [22].Combining radiomics with genomics leads to the field of radiogenomics, which has a goal of noninvasively uncover the radiogenomic biomarkers that could indicate the clinical outcomes of the patients [3]. Besides the challenges in radiomics and genomics, radiogenomics also faces the unpaired data problem. Currently, the publicly available BC datasets are usually incomplete to do a biomarker-oriented radiogenomics study. For example, a dataset may contain medical images and genomics data for the same patients, which provides us with enough information for feature extraction and radiogenomic mapping, but the patients’ clinical outcomes might be hard to obtain as this may need long-term observation. Hence the prognostic significance of the image features could not be evaluated (i.e. the image features cannot be identified as prognostic biomarkers) [23], [24]. Effective utilization of the unpaired imaging, genomic, and clinical data should be considered wisely.In this study, we propose a DL-based radiogenomic framework for prognostic biomarker identification. Our framework includes the following five modules: a DL-based multi-modal image feature extraction module with build-in saliency maps for DL explanation; a BTF multi-genomic feature extraction module with gene set enrichment analysis (GSEA) [25] to explore the biological meaning of the extracted features; a radiogenomic leverage module consists of a series of predictive models to impute the unpaired imaging, genomic, and survival data; a prognostic biomarker identification module which uses survival analysis to evaluate the prognostic significance of each radiogenomic feature; and a statistic mediation analysis module to provide potential biological causal inference of the identified prognostic biomarkers. It is expected that the identified radiogenomic biomarkers have better prognostic significance than the traditional ones.
Material and methods
The overall design of this study is shown in Fig. 1. Single-radiogenomic stage (Fig. 1A) is a baseline workflow with only gene expression as genomic data source. This is to test whether multi-genomic features have better radiogenomic associations than the single-genomic features. Multi-radiogenomic stage (Fig. 1B) is the proposed workflow.
Fig. 1
The overall workflow of this study. A DL model (3DU-net) was built, trained, and validated to segment the tumor region from the raw three-dimensional DCE-MRI image. After the 3DU-net were well-trained, DL-based radiomic (DLR) features were extracted from the last hidden layer in the encoding phase of the model. Gradient-based saliency maps were generated to show the importance of each input pixel to the 3DU-net of making its segmentation decision. A: Single-radiogenomic stage. In this stage, we first focus on the paired data (top panel of A). Three-level gene expression features (197 breast cancer risk gene expressions, 182 KEGG pathway activities, and 6 well-established breast cancer gene signatures) are generated. Then, lasso models are built to predict each DLR feature and semi-auto radiomic (SAR) feature using these three-level gene expression features. After the predictive lasso models are well-trained and validated, we turn to the unpaired data (bottom panel of A). We generate the same three-level gene expression features using the unpaired data, then we apply the well-trained lasso models to get the predicted DLR and SAR features. In this way, we could generate the DLR and SAR features for the 1002 patients without medical images. Then, we performed survival analysis on the predicted DLR and SAR features. The significant ones are the identified prognostic radiogenomic biomarkers. Mediation analysis is then performed on these identified radiogenomic biomarkers to check the potential biological mechanisms of them. B: Multi-radiogenomic stage. In this stage, similar procedures of the single-radiogenomic stage are performed. We first focus on the paired data (top panel of B). We perform Bayesian tensor factorization (BTF) on the multi-genomic data tensor to extract 17 BTF features. We also run gene set enrichment analysis (GSEA) to identify the key biological pathway of each BTF feature. These key pathways could explain the key functions of the identified multi-genomic BTF features. Then we train lasso models to utilize these 17 BTF features for predicting the DLR and SAR features. After the lasso models are well-trained and well-validated, we turn to the unpaired data (bottom panel of B). We obtain the BTF features using the multi-genomic data, then we apply the well-trained lasso models in the previous step to get the predicted DLR and SAR features. In this way, we could get the DLR and SAR features for the 701 patients without medical images. Then, we perform survival analysis on the predicted DLR and SAR features. The significant ones are the identified radiogenomic biomarkers. Mediation analysis is then performed on each of these identified radiogenomic biomarkers to check the potential biological mechanisms of them.
The overall workflow of this study. A DL model (3DU-net) was built, trained, and validated to segment the tumor region from the raw three-dimensional DCE-MRI image. After the 3DU-net were well-trained, DL-based radiomic (DLR) features were extracted from the last hidden layer in the encoding phase of the model. Gradient-based saliency maps were generated to show the importance of each input pixel to the 3DU-net of making its segmentation decision. A: Single-radiogenomic stage. In this stage, we first focus on the paired data (top panel of A). Three-level gene expression features (197 breast cancer risk gene expressions, 182 KEGG pathway activities, and 6 well-established breast cancer gene signatures) are generated. Then, lasso models are built to predict each DLR feature and semi-auto radiomic (SAR) feature using these three-level gene expression features. After the predictive lasso models are well-trained and validated, we turn to the unpaired data (bottom panel of A). We generate the same three-level gene expression features using the unpaired data, then we apply the well-trained lasso models to get the predicted DLR and SAR features. In this way, we could generate the DLR and SAR features for the 1002 patients without medical images. Then, we performed survival analysis on the predicted DLR and SAR features. The significant ones are the identified prognostic radiogenomic biomarkers. Mediation analysis is then performed on these identified radiogenomic biomarkers to check the potential biological mechanisms of them. B: Multi-radiogenomic stage. In this stage, similar procedures of the single-radiogenomic stage are performed. We first focus on the paired data (top panel of B). We perform Bayesian tensor factorization (BTF) on the multi-genomic data tensor to extract 17 BTF features. We also run gene set enrichment analysis (GSEA) to identify the key biological pathway of each BTF feature. These key pathways could explain the key functions of the identified multi-genomic BTF features. Then we train lasso models to utilize these 17 BTF features for predicting the DLR and SAR features. After the lasso models are well-trained and well-validated, we turn to the unpaired data (bottom panel of B). We obtain the BTF features using the multi-genomic data, then we apply the well-trained lasso models in the previous step to get the predicted DLR and SAR features. In this way, we could get the DLR and SAR features for the 701 patients without medical images. Then, we perform survival analysis on the predicted DLR and SAR features. The significant ones are the identified radiogenomic biomarkers. Mediation analysis is then performed on each of these identified radiogenomic biomarkers to check the potential biological mechanisms of them.
Formation of the datasets
Multi-source genomic data (gene expression, CNV, and DNA methylation) of the breast carcinoma cohort (BRCA) are provided by The Cancer Genome Atlas (TCGA) [24] platform. Medical image data, specifically, the three-dimensional dynamic contrast-enhanced Magnetic resonance imaging (DCE-MRI) volumes of a sub-cohort of the BRCA, are collected from The Cancer Imaging Archive (TCIA) [23] platform. Part of these medical images has segmentation labels and 36 traditional semi-auto radiomic (SAR) features provided by the TCIA Breast Phenotype Research Group [26]. The data of BRCA cohort from both TCGA and TCIA form two datasets for this study: BRCA single-radiogenomic dataset, and BRCA multi-radiogenomic dataset. The exact data matching and formation workflow can be found in Fig. 2. The demographic information of the sub-cohorts is listed in Table 1.
Fig. 2
Breast cancer cohort (BRCA) single- and multi- radiogenomic datasets organization flowcharts. TCGA provides genomic data and clinical data of a cohort with 1097 BRCA patients. TCIA provides medical images of partial TCGA-BRCA cohort (137 patients). Four of the 1097 BRCA patients have no meaningful survival information (death days or last contact days are negative numbers) thus were excluded. Nighty-one of 137 TCIA patients have annotated dynamic contrast-enhanced magnetic resonance images (DCE-MRI). A: BRCA single-radiogenomic dataset. One-thousand and nighty-three (1097-4) TCGA-BRCA patients all have gene expression data. Thus, gene expression data were used as baseline single-genomic information in this study to compare with the multi-genomic information. Those 91 patients with annotated DCE-MRI are all included in the 1093 TCGA-BRCA patients. This means, 91 patients have paired gene expression data and annotated image data. However, no survival difference is observed among these 91 patients, because they were all alive according to the last follow-up. Therefore, we cannot perform survival analysis using the paired data. The rest of 1002 (1093-91) patients only have gene expression data (no DCE-MRI data), but there exists survival difference among them. B: BRCA multi-radiogenomic dataset. Only 762 of the 1093 TCGA-BRCA patients have matched gene expression, copy number alteration (CNA), and DNA methylation data. Sixty-one of those 91 TCIA-BRCA patients with annotated DCE-MRI are included in the 762 TCGA-BRCA patients with multi-genomic data. This means, 61 patients have paired multi-genomic data and annotated image data. However, no survival difference is observed among these 61 patients. The rest of 701 (762-61) patients only have multi-genomic data (no medical image data), but there exists survival difference among them.
Table 1
Demographics of BRCA sub-cohorts.
Single-radiogenomic dataset
Multi-radiogenomic dataset
Unpaired data
Paired data
Unpaired data
Paired data
Number of patients
1002
91
701
61
Age at diagnosis
≥65
329
14
215
8
<65
673
77
486
53
Mean
58.9
53.6
58.3
54.0
Min
26
29
26
29
Max
90
82
90
82
Standard deviation
13.3
11.5
13.2
11.5
Stage
I
160
22
107
14
II
561
58
386
39
III
237
11
187
8
X or IV
33
0
15
0
Other
11
0
6
0
ER Status
Positive
729
77
499
53
Negative
224
14
159
8
Not Evaluated
49
0
43
0
PR Status
Positive
625
72
434
48
Negative
325
19
221
13
Indeterminate
4
0
2
0
Not Evaluated
48
0
44
0
HER2 Status
Positive
150
14
81
7
Negative
512
49
353
34
Indeterminate
12
0
12
0
Equivocal
157
22
118
18
Not Evaluated/Available
171
6
137
2
Pam50 subtype
LumA
498
63
366
43
LumB
197
11
132
9
Her2
78
4
44
1
Basal
178
12
126
7
Normal
39
1
33
1
NA
12
0
0
0
Breast cancer cohort (BRCA) single- and multi- radiogenomic datasets organization flowcharts. TCGA provides genomic data and clinical data of a cohort with 1097 BRCA patients. TCIA provides medical images of partial TCGA-BRCA cohort (137 patients). Four of the 1097 BRCA patients have no meaningful survival information (death days or last contact days are negative numbers) thus were excluded. Nighty-one of 137 TCIA patients have annotated dynamic contrast-enhanced magnetic resonance images (DCE-MRI). A: BRCA single-radiogenomic dataset. One-thousand and nighty-three (1097-4) TCGA-BRCA patients all have gene expression data. Thus, gene expression data were used as baseline single-genomic information in this study to compare with the multi-genomic information. Those 91 patients with annotated DCE-MRI are all included in the 1093 TCGA-BRCA patients. This means, 91 patients have paired gene expression data and annotated image data. However, no survival difference is observed among these 91 patients, because they were all alive according to the last follow-up. Therefore, we cannot perform survival analysis using the paired data. The rest of 1002 (1093-91) patients only have gene expression data (no DCE-MRI data), but there exists survival difference among them. B: BRCA multi-radiogenomic dataset. Only 762 of the 1093 TCGA-BRCA patients have matched gene expression, copy number alteration (CNA), and DNA methylation data. Sixty-one of those 91 TCIA-BRCA patients with annotated DCE-MRI are included in the 762 TCGA-BRCA patients with multi-genomic data. This means, 61 patients have paired multi-genomic data and annotated image data. However, no survival difference is observed among these 61 patients. The rest of 701 (762-61) patients only have multi-genomic data (no medical image data), but there exists survival difference among them.Demographics of BRCA sub-cohorts.
Explainable DL-based image feature extraction
DCE-MRI volumes of the same patients were acquired at different time points with an interval of dozens of seconds [27]. That is where the “dynamic” comes from and it is a very strong advantage of DCE-MRI. Besides, the number of DCE-MRI volumes (i.e., time points) variates among patients, depending on the exam pipeline of the imaging institute and the patient’s individual conditions (such as blood flow velocities). To handle this problem, a multi-modal three-dimensional DL model [28], [29], [30], [31] called 3DU-Net [32] was applied to incorporates DCE-MRI volumes acquired at different time points to extract fused and dynamic deep DL-based radiomic (DLR) features. The modality of the input was set to the maximum number of volumes a given patient can have, which is 8. If the given patient has less than 8 DCE-MRI volumes, the position with absent volume was set to empty. The output of the 3DU-Net is the tumor segments provided by TCIA Breast Phenotype Research Group [26]. Two gradient-based saliency maps (Gradient map [33] and Gradient*image map [34]) were embedded in the structure of the 3DU-Net to support explaining the segmentation decision. The detailed model structure of 3DU-Net is shown in Fig. 3.
Fig. 3
The structure of explainable 3DU-Net. The modality of the input is set to 8, which is the maximum number of volumes a patient can have. If a patient has fewer than 8 DCE-MRI volumes, the positions with absent volumes are set to empty. The output is the tumor segment annotation. The last hidden layer of the encoder phase is the DLR features. Two explanation tools (Gradient map and Gradient*image map) are used to increase the explainability of the 3DU-net.
The structure of explainable 3DU-Net. The modality of the input is set to 8, which is the maximum number of volumes a patient can have. If a patient has fewer than 8 DCE-MRI volumes, the positions with absent volumes are set to empty. The output is the tumor segment annotation. The last hidden layer of the encoder phase is the DLR features. Two explanation tools (Gradient map and Gradient*image map) are used to increase the explainability of the 3DU-net.The number of DLR features was set as 32, which is comparable with the 36 SAR features provided with the data. The 91 patients with annotated DCE-MRI data were involved in training and validating the 3DU-net. Patients were randomly split into train set (71 patients), validation set (10 patients), and test set (10 patients). We did hyperparameter tunning for the 3DU-net on stride, learning rate, and dropout ratio. The best hyperparameter combination was then used in the final model. After 1000 epochs, the performance of the segmentation task measured by Dice similarity coefficient (DSC). Given a reference segmentation Slab, the DSC of a predicted segmentation Spred is defined as.Then the well-trained 3DU-net was applied to the whole 91 patients for DLR feature extraction. We then performed pair-wise correlation analysis among all DLR features to see if they are correlated with each other.
BTF for multi-genomic feature extraction
Using the R package “tensorBF” [35], we implemented the BTF algorithm to extract latent factors (patient-directional projection matrix) from the gene expression, CNV, and DNA methylation data for the 762 patients from TCGA-BRCA. More details of the BTF algorithm can be found in Supplementary Fig. 1. We did GSEA using the gene-directional projection matrix to further explore the potential key biological functions of each latent factor. Three-level gene expression features were generated as the baseline, including 196 BC risk genes identified by previous studies [36], [37], 182 pathway activities calculated using the Single Sample Gene Set Enrichment Analysis (ssGSEA) function [38] which was implemented in the GenePattern toolkit [39], and 6 commercialized BC gene signatures calculated using R package “genefu” [40]. We then performed pair-wise correlation analysis among all BTF features and all three-level gene expression features to see if they are correlated with each other.
Leveraging strategy for radiogenomic feature imputation
To perform a biomarker-orientated radiogenomic research, ideally, we need to have matched medical images, genomic profiles, and clinical outcomes measured on the same set of patients, in which we can first identify radiomic biomarkers associated with clinical outcomes (e.g., prognosis), then we can associate the radiomic biomarkers with patients’ genomic profiles to illustrate their biological mechanisms. However, in the majority of cases, we only have one or two sets of data sources measured on the same patients. To solve the challenge, we used a leverage strategy [41]. We first focused on the paired part of the radiogenomic dataset, where we have the paired genomic data and medical image data for 61 patients. we trained lasso models [42] to predict each radiomic feature using the genomic features x (2), (3).The 61 samples were randomly split to train set (43 samples) and test set (18 samples). Prediction performances were evaluated using Root Mean Square Error (RMSE). Then we turned to the unpaired part of the radiogenomic dataset, where we only have the genomic data and patients’ clinical outcomes without medical images for 762 patients. We applied the predictive models which were well-trained in the previous step to get the predicted radiomic features from the genomic features. In this way, we could get a completed paired dataset for further analysis. We also generated radiogenomic correlation map between the radiomic features and the genomic features to explore the potential biological explanations for the relationship of them.
Survival analysis for radiogenomic biomarker identification
We further applied the function “surv_cutpoint” and “surv_categorize” in the R package “survminer” [43] to select the optimized cut-off of each radiogenomic feature to categorize the patients into high and low-risk groups. For each radiogenomic feature, “survminer” looks for the cut-off where the log-rank test for survival analysis can produce the maximum statistic (lowest p-value). We classified the patients into the high-risk group and the low-risk group based on the cut-off for each radiogenomic feature. Then, we utilized the Kaplan-Meier (KM) plot to show the survival difference between the high-risk group and the low-risk group.
Mediation analysis
The complexity of DL models and their low reproducibility have weakened their applications in clinical practice [44]. Hence, to enhance the biological interpretation of the DLR biomarkers, mediation analysis [45] between the genomic features and the BC prognosis through the identified radiogenomic biomarkers are implemented to reason on these radiogenomic biomarkers both biologically and statistically. By testing and estimating the mediation effects of the identified radiogenomic biomarkers on the relationship between genomic features and patient survival, biological interpretation of these radiogenomic biomarkers could be well made. We first regressed the survival outcome variable against each genomic feature
(4). The effect is the total (direct and indirect) effect of the genomic feature on the survival outcome. Then, we regressed the mediator (identified radiogenomic biomarker) against
(5). The effect is the effect of a genomic feature on a mediator. Lastly, we regressed the survival outcome against both and
(6). is the indirect effect of the genomic feature on the survival outcome that goes through the radiogenomic biomarker. is the direct effect of the genomic feature on the survival outcome.These mediation analyses are done using R package “mediation” [46]. The significance of the estimated effects was tested and corrected using Benjamini-Hochberg multiple testing method [47].
Results
Radiomic and genomic features
The hyperparameter tunning for the 3DU-net could be found in Supplementary Table 1. The best hyperparameter combination was stride = 1, learning rate = 0.001, dropout ratio = 0.2. The segmentation performance DSC of the well-trained 3DU-net on the test set is 0.44. The explanation saliency maps are shown in Fig. 4 and 36 DLR features were extracted from the well-trained 3DU-net. According to the saliency maps, the important pixels fall into and around the tumor regions, which means our DL model made a certain segmentation decision mainly based on the tumor as well as surround-tumor regions. Using BTF algorithm, 17 multi-genomic factors were acquired. Their key biological functions identified by GSEA are shown in Table 2. These key functions range from cell division, blood vessel formation, immune response, intercellular signal transmitting, and so on, which quite fit the well-accepted cancer hallmark hypothesis [2]. This means that the multi-omics BTF method captures cancer heterogeneity very well. The pair-wise correlation analysis among radiomic features and genomic features could be found in Fig. 5. DLR features and BTF features are less redundant and could capture more information than SAR features and traditional gene expression features.
Fig. 4
Image data and explanation saliency visualization. The first column is the raw images. The second column is the predicted tumor segments. The third and fourth columns are two kinds of saliency map generated using gradient method and gradient*input method separately.
Table 2
Key enriched pathways for the multi-genomic Bayesian tensor factors.
The radiomic feature correlation analysis and genomic feature correlation analysis. A: The pairwise DLR feature correlations. Columns and rows are 32 DLR features. The darker colors represent the higher correlations. B: The pairwise SAR feature correlations. Columns and rows are 36 SAR features. The darker colors represent the higher correlations. As we can see, some of the SAR features are correlated with each other. C: The canonical correlations of the two radiomic feature matrices (DLR and SAR). The x-axis is the canonical dimensions, while the y-axis is the correlation of the correlations between the DLR features and SAR features in each dimension. It is telling us, these two feature matrices are highly correlated with each other, which also means, the DLR features are able to capture the majority of information that the SAR features captured. D: The scalar plot of the first two dimensions of DLR features and SAR features. Blue ones are the SAR features, while red ones are the DLR features. DLR features may capture more information than the SAR features because the red dots are more widely spread. E: The pairwise BTF multi-genomics feature correlations. Columns and rows are 17 BTF features. The darker colors represent the higher correlations. F: The pairwise three-level gene expression feature correlations. Columns and rows are 197 (risk gene expressions) + 182 (pathway activities) + 6 (gene signatures) = 385 gene expression features. The darker colors represent the higher correlations. According to the results, we could see that BTF features are more independent than the baseline three-level gene expression features. G: The canonical correlations of the two genomic feature matrices (BTF and three-level gene expression features). The x-axis is the canonical dimensions, while the y-axis is the correlation of the correlations between the BTF features and three-level gene expression features in each dimension. H: The scalar plot of the first two dimensions of BTF features and three-level gene expression features. Blue ones are the three-level gene expression features, while red ones are the BTF features. The BTF feature matrix and the three-level gene expression feature matrix are highly correlated with each other, which also means, the BTF features are able to capture the majority of information that the three-level gene expression features captured. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Image data and explanation saliency visualization. The first column is the raw images. The second column is the predicted tumor segments. The third and fourth columns are two kinds of saliency map generated using gradient method and gradient*input method separately.Key enriched pathways for the multi-genomic Bayesian tensor factors.The radiomic feature correlation analysis and genomic feature correlation analysis. A: The pairwise DLR feature correlations. Columns and rows are 32 DLR features. The darker colors represent the higher correlations. B: The pairwise SAR feature correlations. Columns and rows are 36 SAR features. The darker colors represent the higher correlations. As we can see, some of the SAR features are correlated with each other. C: The canonical correlations of the two radiomic feature matrices (DLR and SAR). The x-axis is the canonical dimensions, while the y-axis is the correlation of the correlations between the DLR features and SAR features in each dimension. It is telling us, these two feature matrices are highly correlated with each other, which also means, the DLR features are able to capture the majority of information that the SAR features captured. D: The scalar plot of the first two dimensions of DLR features and SAR features. Blue ones are the SAR features, while red ones are the DLR features. DLR features may capture more information than the SAR features because the red dots are more widely spread. E: The pairwise BTF multi-genomics feature correlations. Columns and rows are 17 BTF features. The darker colors represent the higher correlations. F: The pairwise three-level gene expression feature correlations. Columns and rows are 197 (risk gene expressions) + 182 (pathway activities) + 6 (gene signatures) = 385 gene expression features. The darker colors represent the higher correlations. According to the results, we could see that BTF features are more independent than the baseline three-level gene expression features. G: The canonical correlations of the two genomic feature matrices (BTF and three-level gene expression features). The x-axis is the canonical dimensions, while the y-axis is the correlation of the correlations between the BTF features and three-level gene expression features in each dimension. H: The scalar plot of the first two dimensions of BTF features and three-level gene expression features. Blue ones are the three-level gene expression features, while red ones are the BTF features. The BTF feature matrix and the three-level gene expression feature matrix are highly correlated with each other, which also means, the BTF features are able to capture the majority of information that the three-level gene expression features captured. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Leveraging strategy for prognostic radiogenomic biomarker identification
The root mean square error (RMSE) of the radiogenomic predictive lasso models could be found in Table 3 (for DLR feature prediction) and Supplementary table 2 (for SAR feature prediction). A lower RMSE means a better performance. As we can see, the multi-genomic BTF features perform overall better in predicting the DLR features than the baseline gene expression features. The radiogenomic correlation maps could be found in Supplementary Fig. 2. Twenty-three DLR features are significant in the survival analyses, which means we have identified 23 significant prognostic biomarkers using the proposed method and they have overall lower log-rank p-values than the SAR features (Fig. 6A). The KM plots of the most prognostically significant DLR biomarker (DLR_8) and SAR biomarker (Maximum enhancement) are shown in Fig. 6BC. Table 4 is showing the significant results of the mediation analyses. The most significant DLR biomarker (DLR_8) is a significant mediator of the BTF_4 (Natural killer cell mediated cytotoxicity)’s effect on patient survival.
Table 3
Performance of predictive LASSO models for each DLR feature.
Radiomic feature
Gene expression feature
BTF feature
RMSE
MAE
MAPE
RMSE
MAE
MAPE
DLR_1
83.08
68.11
0.4
57.22
35.87
0.75
DLR_2
71.6
47.31
0.26
49.81
27.36
0.76
DLR_3
65.16
53.33
0.82
21.08
17.65
0.1
DLR_4
60.55
49.51
0.71
22.72
16.18
0.09
DLR_5
61.45
46.26
0.3
16.64
11.69
0.07
DLR_6
60.13
44.88
0.25
28.14
24.48
0.15
DLR_7
53.99
40.35
0.24
50.74
35.39
0.43
DLR_8
54.66
44.39
0.25
53.93
40.58
1
DLR_9
107.26
87.58
4.36
80.61
60.65
3.52
DLR_10
98.83
79.18
2.85
25.76
25.51
0.13
DLR_11
45.66
34.04
0.23
12.01
12.01
0.06
DLR_12
24.76
17.26
0.12
30.44
13.04
0.12
DLR_13
27.85
18.32
0.13
30.63
13.29
0.13
DLR_14
27.52
18.57
0.13
33.67
20.68
0.16
DLR_15
104.52
81.16
4.08
69.06
50.32
2.26
DLR_16
89.73
65.58
2.42
53.86
33.54
1.74
DLR_17
102.34
85.94
0.51
44.35
36.26
0.3
DLR_18
77.37
60.19
0.49
47.01
31.37
0.59
DLR_19
58.2
46.47
0.77
27.88
26.51
0.15
DLR_20
62.56
42.71
0.22
22.39
17.25
0.11
DLR_21
37.11
27.53
0.15
32.35
26.22
0.19
DLR_22
64.53
51.65
0.3
25.29
18.7
0.09
DLR_23
50.77
37.48
0.24
29
23.1
0.15
DLR_24
57.25
45.85
0.27
52.08
33.93
1.14
DLR_25
63.32
49.08
1.14
70.53
38.77
2.57
DLR_26
22.65
17.78
0.1
12.98
9.12
0.05
DLR_27
21.42
12.33
0.09
4.84
4.84
0.02
DLR_28
32.88
16.05
0.17
3.03
3.03
0.02
DLR_29
7.35
5.67
0.03
0.01
0.01
0
DLR_30
17.17
13.57
0.07
2.79
2.79
0.01
DLR_31
60.98
44.31
0.95
17.77
16.69
0.09
DLR_32
94.82
76.64
2.78
53.41
41.07
1.59
Fig. 6
Prognostically significant DLR features and SAR features. A: The sorted p-values of survival analyses. DLR features showed overall lower p-values than SAR features. B: The most prognostically significant SAR feature. Maximum enhancement has the lowest p-value (2.45e-05) among all SAR features in the survival analyses. C: The most prognostically significant DLR feature. DLR-8 has the lowest p-value (5.97e-06) among all DLR features in the survival analyses.
Table 4
The significant results of mediation analyses of the identified biomarkers.
Independent variable
Mediator
ACME*
ACME_pvalue
ADE*
ADE_pvalue
TE*
TE_pvalue
Metabolism of xenobiotics by cytochrome
DLR_7
−0.03
0.044
0.21
<2e-16
0.18
0.004
BTF_4
DLR_8
30.38
0.05
−118.63
<2e-16
−88.25
0.026
BTF_7
DLR_2
9.00
0.046
−87.52
0.036
−78.53
0.05
*ACME (average causal mediation effects): indirect effect of the IV on the DV that goes through the mediator.
*ADE (average direct effects): direct effect of the IV on the DV.
*TE (total effect): direct and indirect effect of the IV on the DV.
Performance of predictive LASSO models for each DLR feature.Prognostically significant DLR features and SAR features. A: The sorted p-values of survival analyses. DLR features showed overall lower p-values than SAR features. B: The most prognostically significant SAR feature. Maximum enhancement has the lowest p-value (2.45e-05) among all SAR features in the survival analyses. C: The most prognostically significant DLR feature. DLR-8 has the lowest p-value (5.97e-06) among all DLR features in the survival analyses.The significant results of mediation analyses of the identified biomarkers.*ACME (average causal mediation effects): indirect effect of the IV on the DV that goes through the mediator.*ADE (average direct effects): direct effect of the IV on the DV.*TE (total effect): direct and indirect effect of the IV on the DV.
Discussion
Two advanced mathematical methods, BTF and DL, were used to estimate the multi-level genome and morphological heterogeneity of BC. BTF plus GSEA successfully provided us with biologically meaningful multi-genomic features. And their key biological functions are highly related to the known hallmarks of cancer [2], including signaling, cell cycle, metabolism, and immune related pathways. BTF features are more advanced than the single-source genomic features because they not only consider multiple genomic sources, but also consider the interaction between them. DL could extract image features automatically and objectively but its explainability needs to be increased. The proposed workflow increased the explainability of the DL-based image feature extraction in two ways, one is by adding two explanation tools into the model structure, the other is by introducing domain knowledges to support the extracted image features. According to our experiment, the DLR features performed better than the traditional SAR features, thus, we believe once the explainability issue is addressed, DL will have a bright future in healthcare data analyzing.Leveraging strategy is often seen in biomedical field [48], [49] because healthcare data is often not easy to get and thus will lead to the unpaired data problem. This is the first time that the leveraging strategy being introduced into DL-based radiogenomics. It successfully solved the unpaired data problem in this case. Taking advantage of the estimated radiogenomic features which representing the multi-level tumor heterogeneity, we successfully identified several prognostic biomarkers using the proposed workflow. The most prognostically significant radiogenomic biomarker has a potential intermediate effect on the causal relationship between the function of nature killer cells and patient’s survival time. The identified BC prognostic radiogenomic biomarkers are non-invasive and effectively representing both medical imaging and multi-genomic information. They are clinically more feasible because they could be obtained from medical images only, no need to perform the invasive biopsy.In conclusion, we provided a comprehensive radiogenomic workflow which could overcome major difficulties of current radiogenomic studies. Our experiments showed that the proposed workflow could identify non-invasive, objective, automatic, integrated, and explainable BC radiogenomic biomarkers with great prognostic significance comparing with the baselines. Our results also uncover genetic mechanisms regulating clinical phenotypes. Such mechanisms could promote medical imaging as a non-invasive examination of probing BC molecular status, then support clinical decisions and ultimately improve patient care.
Funding
This work was supported in part by and . P.H. is the holder of Manitoba Medical Services Foundation (MMSF) Allen Rouse Basic Science Career Development Research Award.
Authors: Olivier Gevaert; Jiajing Xu; Chuong D Hoang; Ann N Leung; Yue Xu; Andrew Quon; Daniel L Rubin; Sandy Napel; Sylvia K Plevritis Journal: Radiology Date: 2012-06-21 Impact factor: 11.105
Authors: Kenneth Clark; Bruce Vendt; Kirk Smith; John Freymann; Justin Kirby; Paul Koppel; Stephen Moore; Stanley Phillips; David Maffitt; Michael Pringle; Lawrence Tarbox; Fred Prior Journal: J Digit Imaging Date: 2013-12 Impact factor: 4.056
Authors: Bo Wang; Aziz M Mezlini; Feyyaz Demir; Marc Fiume; Zhuowen Tu; Michael Brudno; Benjamin Haibe-Kains; Anna Goldenberg Journal: Nat Methods Date: 2014-01-26 Impact factor: 28.547
Authors: Joost J M van Griethuysen; Andriy Fedorov; Chintan Parmar; Ahmed Hosny; Nicole Aucoin; Vivek Narayan; Regina G H Beets-Tan; Jean-Christophe Fillion-Robin; Steve Pieper; Hugo J W L Aerts Journal: Cancer Res Date: 2017-11-01 Impact factor: 12.701
Authors: Joseph S Baxter; Olivia C Leavy; Nicola H Dryden; Sarah Maguire; Nichola Johnson; Vita Fedele; Nikiana Simigdala; Lesley-Ann Martin; Simon Andrews; Steven W Wingett; Ioannis Assiotis; Kerry Fenwick; Ritika Chauhan; Alistair G Rust; Nick Orr; Frank Dudbridge; Syed Haider; Olivia Fletcher Journal: Nat Commun Date: 2018-03-12 Impact factor: 14.919