| Literature DB >> 35453140 |
Sarah Mubeen1,2,3, Alpha Tom Kodamullil1, Martin Hofmann-Apitius1,2, Daniel Domingo-Fernández1,3,4.
Abstract
Pathway enrichment analysis has become a widely used knowledge-based approach for the interpretation of biomedical data. Its popularity has led to an explosion of both enrichment methods and pathway databases. While the elegance of pathway enrichment lies in its simplicity, multiple factors can impact the results of such an analysis, which may not be accounted for. Researchers may fail to give influential aspects their due, resorting instead to popular methods and gene set collections, or default settings. Despite ongoing efforts to establish set guidelines, meaningful results are still hampered by a lack of consensus or gold standards around how enrichment analysis should be conducted. Nonetheless, such concerns have prompted a series of benchmark studies specifically focused on evaluating the influence of various factors on pathway enrichment results. In this review, we organize and summarize the findings of these benchmarks to provide a comprehensive overview on the influence of these factors. Our work covers a broad spectrum of factors, spanning from methodological assumptions to those related to prior biological knowledge, such as pathway definitions and database choice. In doing so, we aim to shed light on how these aspects can lead to insignificant, uninteresting or even contradictory results. Finally, we conclude the review by proposing future benchmarks as well as solutions to overcome some of the challenges, which originate from the outlined factors.Entities:
Keywords: benchmark; gene set analysis; gene set collection; omics data; pathway database; pathway enrichment
Mesh:
Year: 2022 PMID: 35453140 PMCID: PMC9116215 DOI: 10.1093/bib/bbac143
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 13.994
Figure 1Illustration of major factors that influence the results of pathway enrichment analysis discussed in this review. The height and color of the bars are symbolic and do not correlate with importance. The two networks depicted above represent two biological pathways mapped to gene expression data (matrix below).
Comparative studies evaluating differences across enrichment methods
| No. | Review | Methods tested | Datasets | Database (# of gene sets/pathways) | Types of evaluated methods |
|---|---|---|---|---|---|
| 1 | [ | 7 | 36 | KEGG (116) | Topology- and non-topology-based methods |
| 2 | [ | 10 | 75 | KEGG (323) and GO (4631) | ORA and FCS methods |
| 3 | [ | 7 | 118 | KEGG (232) | Topology-based methods |
| 4 | [ | 6 | 20 | KEGG (86) | Topology- and non-topology-based methods |
| 5 | [ | 9 | 3 | KEGG (114) | Topology-based methods |
| 6 | [ | 13 | 6 | GO gene set collection extracted from MSigDB [ | Widely used pathway enrichment methods |
| 7 | [ | 8 | 3 | MSigDB v5.0 (10,295) | Widely used pathway enrichment methods |
| 8 | [ | 10 | 86 | KEGG; 150 pathways for all methods except 130 for PathNet [ | Topology- and non-topology-based methods |
| 9 | [ | 11 | 1 | C2 collection from MSigDB v4.0 (4722) | Methods differing based on null hypothesis |
| 10 | [ | 16 | 42 | KEGG (259) and Metacore™ (88) | ORA and FCS methods |
| 11 | [ | 5 | 6 | KEGG (192) | ORA and FCS methods |
| 12 | [ | 7 | 38 | KEGG (189) | ORA and FCS methods |
In the third column, we report the number of enrichment methods compared in each study (see Supplementary Tables 2 and 3, available online at https://academic.oup.com/bib, for details on the methods tested). Here, we would like to note that we differentiate between methods and tools/web applications based on Geistlinger et al. [2]. In the fourth column, we report the number of datasets each study performed comparisons on, all of which were experimental datasets except in [3, 13, 14, 18, 22], which included both experimental and simulated datasets. Finally, the fifth column reports the pathway databases used in each study while the number of pathways is shown between parentheses.