| Literature DB >> 35205761 |
Mateusz Garbulowski1,2, Karolina Smolinska1, Uğur Çabuk1,3,4, Sara A Yones1, Ludovica Celli1,5,6, Esma Nur Yaz1,7, Fredrik Barrenäs1,8, Klev Diamanti1,9, Claes Wadelius9, Jan Komorowski1,8,10,11.
Abstract
Gliomas develop and grow in the brain and central nervous system. Examining glioma grading processes is valuable for improving therapeutic challenges. One of the most extensive repositories storing transcriptomics data for gliomas is The Cancer Genome Atlas (TCGA). However, such big cohorts should be processed with caution and evaluated thoroughly as they can contain batch and other effects. Furthermore, biological mechanisms of cancer contain interactions among biomarkers. Thus, we applied an interpretable machine learning approach to discover such relationships. This type of transparent learning provides not only good predictability, but also reveals co-predictive mechanisms among features. In this study, we corrected the strong and confounded batch effect in the TCGA glioma data. We further used the corrected datasets to perform comprehensive machine learning analysis applied on single-sample gene set enrichment scores using collections from the Molecular Signature Database. Furthermore, using rule-based classifiers, we displayed networks of co-enrichment related to glioma grades. Moreover, we validated our results using the external glioma cohorts. We believe that utilizing corrected glioma cohorts from TCGA may improve the application and validation of any future studies. Finally, the co-enrichment and survival analysis provided detailed explanations for glioma progression and consequently, it should support the targeted treatment.Entities:
Keywords: TCGA; batch effect; co-enrichment; glioma; machine learning; rough sets
Year: 2022 PMID: 35205761 PMCID: PMC8870250 DOI: 10.3390/cancers14041014
Source DB: PubMed Journal: Cancers (Basel) ISSN: 2072-6694 Impact factor: 6.639
Figure 1Overview of the pipeline applied in this work. Following preprocessing were two separate analyses: The lower tier of the pipeline illustrates the steps employed for identification and basic analysis of DEGs, while the upper tier demonstrates the steps for ssGSEA analysis based on ML approaches using MSigDB collections. The final step illustrates the validation of the results.
A summary of sample amounts used in the analysis to obtain and validate results. In total, 1671 publicly available samples were used in this analysis.
| TCGA | CGGA | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| GM1 | GM2 | Batch 1 | Batch 2 | ||||||
| GII | GIII | LGG | GBM | GII | GIII | GBM | GII | GIII | GBM |
| 231 | 168 | 108 | 151 | 188 | 255 | 249 | 103 | 79 | 139 |
The results of DEG list validation with CGGA cohorts. p values in validation cohorts were FDR-adjusted.
| CGGA Batch 1 | CGGA Batch 2 | ||||||
|---|---|---|---|---|---|---|---|
| GII vs. GIII | LGG vs. GBM | GII vs. GIII | LGG vs. GBM | ||||
| 62% | 88% | 27% | 44% | 85% | 96% | 52% | 71% |
Figure 2Evaluation of MSigDB collections using ML models for discerning glioma grades using ssGSEA scores for classifying (A) GII vs. GIII and (B) LGG vs. GBM. Five different ML approaches were used: SMO, IBk, Bagging, J48 and JRip. Each ML method was undersampled 20-times with 10-fold CV. The median was marked with a black dot on each violin.
Figure 3(A,B) Evaluation of MSigDB collections using ML models with feature selection for discerning glioma grades using ssGSEA scores. Five different ML approaches were used: SMO, IBk, Bagging, J48 and JRip. Each ML method was undersampled 20-times with 10-fold CV. The median was marked with a black dot on each violin. Panels (C,E,G) represent MCFS results for the top three collections for GII vs. GIII. Size of annotations represents RI values from MCFS. Panels (D,F,H) represent MCFS results for the top three collections for GII vs. GIII. Size of annotations represents RI values from MCFS.
Figure 4Rule-based network displaying the most relevant co-enrichments of annotations obtained from (A,B) the CGP collection for the GII vs. GIII model (Table S7) and from (C,D) the GOCC collection for the LGG vs. GBM model (Table S8). The networks show the 20 most connected nodes obtained from the top 10% of significant rules (FDR-adjusted p-value < 0.01) based on the rule connection. Connection values of nodes and edges represent a strength of co-enrichment from the classifier. Subnetworks were generated separately with respect to the decision class for each RBM.
Figure 5Survival curves of several NOIs characterized based on rule networks. We investigated NOIs for the topmost predictive three MSigDB collections discerning glioma grades: (A–D) CGP and GOCC, (E–H) BioCarta and GOBP, and (I–L) PID and WP. Each plot displays a p-value that was estimated with the default set of parameters while constructing the curves (Table S1).