| Literature DB >> 31824580 |
Sarah Mubeen1,2, Charles Tapley Hoyt1,2, André Gemünd1, Martin Hofmann-Apitius1,2, Holger Fröhlich2, Daniel Domingo-Fernández1,2.
Abstract
Pathway-centric approaches are widely used to interpret and contextualize -omics data. However, databases contain different representations of the same biological pathway, which may lead to different results of statistical enrichment analysis and predictive models in the context of precision medicine. We have performed an in-depth benchmarking of the impact of pathway database choice on statistical enrichment analysis and predictive modeling. We analyzed five cancer datasets using three major pathway databases and developed an approach to merge several databases into a single integrative one: MPath. Our results show that equivalent pathways from different databases yield disparate results in statistical enrichment analysis. Moreover, we observed a significant dataset-dependent impact on the performance of machine learning models on different prediction tasks. In some cases, MPath significantly improved prediction performance and also reduced the variance of prediction performances. Furthermore, MPath yielded more consistent and biologically plausible results in statistical enrichment analyses. In summary, this benchmarking study demonstrates that pathway database choice can influence the results of statistical enrichment analysis and predictive modeling. Therefore, we recommend the use of multiple pathway databases or integrative ones.Entities:
Keywords: benchmarking; databases; machine learning; pathway enrichment; statistical hypothesis testing
Year: 2019 PMID: 31824580 PMCID: PMC6883970 DOI: 10.3389/fgene.2019.01203
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Number of publications citing major pathway resources for pathway enrichment in PubMed Central (PMC), 2019. To develop an estimate on the number of publications using several pathway databases for pathway enrichment, SCAIView (http://academia.scaiview.com/academia; indexed on 01/03/2019) was used to conduct the following query using the PMC corpus: “
| Type | Pathway resource | Publications |
|---|---|---|
| 27,713 | ||
| 3,765 | ||
| 651 | ||
| 2,892 | ||
| 339 | ||
| 1,640 |
Figure 1Schema illustrating the generation of MPath. The curated pathway mapping catalog is depicted in (A), which links equivalent pathways from different resources. Pathways that are shared across two resources are referred to as pathway analogs (i.e., Pathway A in Reactome and Pathway A′ in KEGG) and pathways that are shared across all three resources are referred to as “super pathways” (i.e., Pathway A in KEGG, Pathway A′ in Reactome, and Pathway A″ in WikiPathways). (B) Using these mappings, gene sets of equivalent pathways from different resources can be combined, ensuring key molecular players from the different resources are included. (C) Similarly, network representations of the pathways can be overlaid to generate more comprehensive pathways. (D) Finally, both the combined gene sets and networks representations are included in MPath. Note that pathways that are exclusive to a single database are included in MPath unchanged.
Statistics of the five TCGA cancer datasets used in this work.
| Cancer type | TCGA abbreviation | Tumor samples | Normal samples | Surviving patients | Deceased patients |
|---|---|---|---|---|---|
| BRCA | 1,102 | 113 | 946 | 153 | |
| KIRC | 538 | 72 | 365 | 173 | |
| LIHC | 371 | 50 | 240 | 130 | |
| PRAD | 498 | 52 | 498 | 10 | |
| OV | 374 | 0 | 143 | 229 |
The statistics correspond to those retrieved from the GDC portal and cBioportal on 14-03-2019. Longitudinal statistics of survival data are presented in .
Figure 2Design of the benchmarking schema. The influence of alternative pathway databases on the results of statistical pathway enrichment (left) and machine learning classification tasks (right) are compared.
A qualitative description of the computational costs of the analyses performed.
| Task | Relative memory usage | Timescale |
|---|---|---|
| ORA | Low | Seconds |
| GSEA | Medium | Minutes |
| ssGSEA | Very high | Hours |
| Prediction of tumor vs. normal | Medium | Minutes |
| Prediction of known tumor subtype | Medium | Minutes |
| Prediction of overall survival | Medium | Hours |
Performing ssGSEA required on the scale of 100 GB of RAM for some dataset/database combinations, while the other tasks could be run on a modern laptop with no issues.
Figure 3Left Distribution of raw p values of pathway analogs across databases [top to bottom: overrepresentation analysis (ORA), gene set enrichment analysis (GSEA), and signaling pathway impact analysis (SPIA)]. Right Significance of average rank differences of pathway analogs across pairwise database comparisons for the given method.
Figure 4Comparison of prediction performance of an elastic net classifier (tumor vs. normal) using single-sample gene set enrichment analysis (ssGSEA)-based pathway activity profiles computed from different resources. Each box plot shows the distribution of the area under the ROC curves (AUCs) over 10 repeats of the 10-fold cross-validation procedure.
Figure 5Comparison of prediction performance of an elastic net classifier (BRCA and PRAD subtypes) using single-sample gene set enrichment analysis (ssGSEA)-based pathway activity profiles computed from different resources. Each box plot shows the distribution of the area under the ROC curves (AUCs) over 10 repeats of the 10-fold cross-validation procedure.
Figure 6Comparison of prediction performance of an elastic net penalized Cox regression model (overall survival) using single-sample gene set enrichment analysis (ssGSEA)-based pathway activity profiles computed from different resources. Each box plot shows the distribution of the area under the ROC curves (AUCs) over 10 repeats of the 10-fold cross-validation procedure.