| Literature DB >> 31533822 |
Petr V Nazarov1, Anke K Wienecke-Baldacchino2,3, Andrei Zinovyev4,5, Urszula Czerwińska4,5,6, Arnaud Muller7, Dorothée Nashan8, Gunnar Dittmar7, Francisco Azuaje7, Stephanie Kreis2.
Abstract
BACKGROUND: The amount of publicly available cancer-related "omics" data is constantly growing and can potentially be used to gain insights into the tumour biology of new cancer patients, their diagnosis and suitable treatment options. However, the integration of different datasets is not straightforward and requires specialized approaches to deal with heterogeneity at technical and biological levels.Entities:
Keywords: Cancer; Deconvolution; Independent component analysis; Survival analysis; Transcriptomics
Year: 2019 PMID: 31533822 PMCID: PMC6751789 DOI: 10.1186/s12920-019-0578-4
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Fig. 1Visualization of the approach taken to data analysis. A large discovery dataset and a small investigation dataset from patients (both mRNA) were concatenated and analysed together by ICA. As a result, two matrices were obtained: (metagenes), containing contribution of the genes to each component, and (metasamples), presenting the weights of the components in the samples. provides gene signatures for each of the components, which could be linked to cellular processes by standard functional annotation or enrichment analysis. can be linked to clinical data and used to predict classes of new patients and their survival
Fig. 2Data overview in the space defined by principal and independent components. Data variability captured by the first components of PCA (a) and two selected components of ICA (b) in gene expression data. Independent components were selected based on the predictive power of their weights for patient gender (RIC3) and sample type (RIC5). MiRNA data showed even higher discrepancy comparing miRNA-seq and qPCR results by PCA (c). However, in the space of independent components (MIC1 and MIC9), the samples studied by miRNA-seq and qPCR overlap (d)
Performances of ICA-based feature extraction. Mean values of sensitivity and specificity are reported as well as class probability originated from random forest voting
| Predicted variables | Groups | Accuracy (st.dev.) | Sensitivity specificity | P2PM (prob.) | P4PM (prob.) | P6PM (prob.) | P4NS (prob.) | NHEM (prob.) |
|---|---|---|---|---|---|---|---|---|
| Gender | female: 179 | 0.996 (< 0.001) | 0.994 0.994 | female (0.73) | female (0.66) | male (0.79) | female (0.68) | female (0.67) |
| male: 293 | ||||||||
| Sample type | primary: 105 | 0.871 (0.003) | 0.733 0.733 | primary (0.68) | primary (0.55) | primary (0.65) | primary (0.59) | meta-static (0.51) |
| metastatic: 367 | ||||||||
| Subtype (RNA cluster) | immune: 170 | 0.902 (0.006) | 0.877 0.945 | keratin (0.64) | keratin (0.48) | keratin (0.61) | keratin (0.64) | keratin (0.55) |
| keratin: 102 | ||||||||
| MITF-low: 59 |
Fig. 3Benchmarking of ICA and other dimensionality reduction methods. Accuracies for classifying patients by gender (a), sample type (b) and tumour subtypes (c) were compared using 8 distinct methods. PCA was applied on the original data (PCA), as well as on the data corrected data using ComBat (PCA_ComBat) and XPN (PCA_XPN). The presented tools are described in the Methods section
Figure 4ICA-based risk score (RS) can predict patient survival. Performance of the risk score on the TCGA discovery patient cohort (a). Validation of the risk score on the independent cohort composed of 44 metastatic melanoma patients (b). Cox regression log hazard ratio (LHR) together with its 95% C.I. and log rank p-value are reported. In order to visualize the results as Kaplan-Meier curves, patients were divided into two groups by their RS (low risk – blue and high risk – red)
Fig. 5Correlated component clusters. Heatmaps showing coefficient of determination (r2) between weights of RIC-RIC (a), MIC-MIC (b) and RIC-MIC(c). The cluster of components (d) is based on gene components (RICs) linked to immune response via enrichment analysis of top-contributing genes; cluster (e) is based on RICs linked to angiogenesis and stroma transcriptional signal. The size of the circles illustrates the number of top-contributing genes and miRNAs in the components. RIC and MIC components have been linked to each other on basis of correlation (edges between components show r2 > 0.25). As an additional validation, the weights of the described components were compared with ESTIMATE [9] scores and corresponding r 2 are shown in (f). The weights of the RIC25 and RIC13 components correlated best to immune and stromal scores, shown in (g)
Fig. 6Biologically relevant components and their ranked weights in the investigation dataset. Rank for samples is calculated in comparison to the TCGA discovery set (red – weight above median in TCGA samples, blue – below)