| Literature DB >> 30340461 |
Ruth Alexandra Stoney1, Jean-Marc Schwartz2, David L Robertson2,3, Goran Nenadic4,5.
Abstract
BACKGROUND: The consolidation of pathway databases, such as KEGG, Reactome and ConsensusPathDB, has generated widespread biological interest, however the issue of pathway redundancy impedes the use of these consolidated datasets. Attempts to reduce this redundancy have focused on visualizing pathway overlap or merging pathways, but the resulting pathways may be of heterogeneous sizes and cover multiple biological functions. Efforts have also been made to deal with redundancy in pathway data by consolidating enriched pathways into a number of clusters or concepts. We present an alternative approach, which generates pathway subsets capable of covering all of genes presented within either pathway databases or enrichment results, generating substantial reductions in redundancy.Entities:
Keywords: Data redundancy; Gene enrichment analysis; Pathways; Set cover
Mesh:
Year: 2018 PMID: 30340461 PMCID: PMC6194563 DOI: 10.1186/s12859-018-2355-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Set cover. a A simple set of overlapping sets. b The red set with 8 uncovered elements is selected first. c The blue set with 3 elements is selected second. d The orange set then covers all the elements in the universe
Fig. 2Jaccard coefficient between pathway pairs in the cover set results produced by each algorithm
Fig. 3Redundancy in set cover outputs given different GC values
Fig. 4Pathway sizes in cover set when GC is set to (a) 100%, b 99%, c 95% and (d) 90%. The boxes indicate the 25th and 75th percentiles and the whiskers indicate the 5th and 95th percentiles
Proportion of pathways from CPDB databases
| Median size | CPDB % | Standard set cover | Hitting set cover | Proportional set cover | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 100% | 99% | 95% | 90% | 100% | 99% | 95% | 90% | 100% | 99% | 95% | 90% | |||
| BioCarta | 15.0 | 6.3 | 6.3 | 4.6 | 0.5 | 0.0 | 4.7 | 4.8 | 5.4 | 5.4 | 5.8 | 6.1 | 6.1 | 5.0 |
| EHMN | 32.5 | 1.6 | 3.2 | 3.4 | 2.6 | 1.0 | 2.1 | 2.3 | 1.8 | 1.6 | 1.6 | 1.4 | 0.9 | 0.9 |
| HumanCyc | 5.0 | 8.2 | 6.5 | 7.7 | 2.6 | 0.0 | 10.1 | 10.9 | 12.9 | 14.3 | 10.9 | 11.7 | 13.7 | 15.4 |
| INOH | 34.5 | 2.3 | 1.7 | 1.9 | 1.0 | 1.0 | 0.8 | 0.6 | 0.3 | 0.2 | 1.1 | 1.1 | 0.9 | 0.7 |
| KEGG | 65.0 | 7.2 | 29.0 | 30.5 | 37.6 | 40.4 | 15.8 | 15.0 | 13.5 | 13.4 | 12.2 | 9.9 | 8.3 | 7.1 |
| NetPath | 51.0 | 0.9 | 2.1 | 2.4 | 3.6 | 5.1 | 1.1 | 1.2 | 1.1 | 1.0 | 1.0 | 0.9 | 0.6 | 0.2 |
| PharmGKB | 13.0 | 2.8 | 3.1 | 2.9 | 0.5 | 0.0 | 2.0 | 2.1 | 2.4 | 2.3 | 2.1 | 2.2 | 2.1 | 1.7 |
| PID | 35.0 | 5.2 | 15.6 | 13.9 | 10.3 | 6.1 | 9.5 | 9.8 | 9.4 | 8.5 | 8.2 | 8.3 | 6.4 | 4.6 |
| Reactome | 17.0 | 39.6 | 4.2 | 5.3 | 10.8 | 21.2 | 36.1 | 35.1 | 34.7 | 35.3 | 39.4 | 40.9 | 45.1 | 48.8 |
| Signalink | 32.0 | 0.4 | 1.0 | 1.2 | 1.0 | 0.0 | 0.6 | 0.7 | 0.7 | 0.7 | 0.7 | 0.7 | 0.7 | 0.8 |
| SMPDB | 11.0 | 16.7 | 1.7 | 1.4 | 0.5 | 0.0 | 1.6 | 1.5 | 1.4 | 1.2 | 2.8 | 3.0 | 2.9 | 3.2 |
| Wikipathways | 26.0 | 8.8 | 25.6 | 24.9 | 28.9 | 25.3 | 15.6 | 16.0 | 16.2 | 16.2 | 14.2 | 13.7 | 12.5 | 11.7 |
Median size represents the median sizes of the pathways in the CPDB dataset. CPDB % represents the proportion of the pathways in the unaltered dataset that came from each database. The following columns represent the proportion of pathways in the set cover generated by the standard set cover algorithm, the hitting set cover algorithm and the proportional set cover algorithm. Different results are obtained by altering the proportion of the gene set covered, shown in subcolumns below the algorithm header
Fig. 5Pathway redundancy heat maps. a Pathway overlap for top ten enriched pathways. b Pathway overlap for top ten enriched pathways after application of set cover. The values represent asymmetric overlap, i.e. for each pathway shown on the left axis, values represent the proportion of genes that are also included in the pathway shown on the bottom axis