| Literature DB >> 28499413 |
Joanna Zyla1, Michal Marczyk2, January Weiner3, Joanna Polanska1.
Abstract
BACKGROUND: There exist many methods for describing the complex relation between changes of gene expression in molecular pathways or gene ontologies under different experimental conditions. Among them, Gene Set Enrichment Analysis seems to be one of the most commonly used (over 10,000 citations). An important parameter, which could affect the final result, is the choice of a metric for the ranking of genes. Applying a default ranking metric may lead to poor results. METHODS ANDEntities:
Keywords: Functional genomics; GSEA; Pathway analysis; Ranking metrics
Mesh:
Year: 2017 PMID: 28499413 PMCID: PMC5427619 DOI: 10.1186/s12859-017-1674-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
General information about used microarray data sets
| GEO | Target KEGG ID | Disease/KEGG pathway name | Tissue | Sample Size (Control+Case) |
|---|---|---|---|---|
| GSE1145 | hsa:05414 | Dilated cardiomyopathy | Left Ventricle | 26 (11+15) |
| GSE14924_CD4 | hsa:05221 | Acute myeloid leukemia | CD4 T cells | 20 (10+10) |
| GSE14924_CD8 | hsa:05221 | Acute myeloid leukemia | CD8 T cells | 21 (11+10) |
| GSE16759 | hsa:05010 | Alzheimer’s disease | Parietal lobe | 8 (4+4) |
| GSE24739_G0 | hsa:05220 | Chronic myeloid leukemia | Peripheral blood | 12 (4+8) |
| GSE24739_G1 | hsa:05220 | Chronic myeloid leukemia | Peripheral blood | 12 (4+8) |
| GSE32676 | hsa:05212 | Pancreatic cancer | Pancreas | 32 (7+25) |
| GSE4183 | hsa:05210 | Colorectal cancer | Colon | 23 (8+15) |
| GSE1297 | hsa:05010 | Alzheimer’s disease | Hipopocampal CA1 | 16 (9+7) |
| GSE14762 | hsa:05211 | Renal Cancer | Kidney | 21 (12+9) |
| GSE19188 | hsa:05223 | Non-small cell lung cancer | Lung | 153 (62+91) |
| GSE19728 | hsa:05214 | Glioma | Brain | 21 (4+17) |
| GSE20153 | hsa:05012 | Parkinson’s disease | Lymphoblasts | 16 (8+8) |
| GSE20291 | hsa:05012 | Parkinson’s disease | Ppstmortem brain putmen | 33 (19+14) |
| GSE21354 | hsa:05214 | Glioma | Brain, Spine | 17 (4+13) |
| GSE3585 | hsa:05414 | Dilated cardiomyopathy | Subendocardial left ventricle | 12 (5+7) |
| GSE4107 | hsa:05210 | Colorectal cancer | Mucosa | 22 (10+12) |
| GSE5281_EC | hsa:05010 | Alzheimer’s disease | Entorhinal cortex | 21 (12+9) |
| GSE5281_HIP | hsa:05010 | Alzheimer’s disease | Hippocampus | 23 (13+10) |
| GSE5281_VCX | hsa:05010 | Alzheimer’s disease | Primary visual cortex | 31 (12+19) |
| GSE781 | hsa:05211 | Renal Cancer | Kidney | 17 (5+12) |
| GSE8762 | hsa:05016 | Huntington’s disease | Lymphocytes | 22 (10+12) |
| GSE9348 | hsa:05210 | Colorectal cancer | Colon | 82 (12+70) |
| GSE9476 | hsa:05221 | Acute myeloid leukemia | Peripheral Blood | 63 (37+26) |
| GSE6344 | hsa:05211 | Renal Cancer | Kidney | 20 (9+11) |
| GSE15641 | hsa:05211 | Renal Cancer | Kidney | 55 (23+32) |
| GSE14994 | hsa:05211 | Renal Cancer | Kidney | 30 (8+22) |
| GSE11024 | hsa:05211 | Renal Cancer | Kidney | 22 (12+10) |
Description of ranking metrics sorted from the most parametric, through non-parametric to data mining methods
| Metrics | Description | Comments | Ref. |
|---|---|---|---|
| T-test |
| [ | |
| MWT |
| and absolute value | [ |
| MSD |
| [ | |
| S2N |
| and absolute value | [ |
| WAD |
| and absolute value | [ |
|
| |||
| Difference |
| [ | |
| Ratio |
| and log2 | [ |
| FCROS |
| [ | |
| k - pairwise comparison; FC - fold change, N - no. of genes | |||
| SoR |
| [ | |
| BWS |
| [ | |
|
| |||
| ReliefF |
| and tied rank | [ |
Fig. 1Boxplots of surrogate sensitivity and FPR of gene set analysis. Panel a represents the distribution of target pathways enrichment p-value to each metric presented in logarithmic scale - the lower the better; Panel b represents the results of FPR estimation, where the red line represents the expected outcome - the closer to 5% the better
Results of overall sensitivity, false positive rate and average evaluation time for all ranking metrics
| Rank metric |
|
| Average evaluation time [s] |
|---|---|---|---|
| T-test | 0.066 | 8.162 | 118.363 |
| MWT | 0.928 | 19.634 | 191.944 |
| |MWT| | 0.998 | 3.665 | 191.944 |
| MSD | 0.559 | 0.008 | 117.627 |
| S2N | 0.926 | 17.932 | 115.090 |
| |S2N| | 0.981 | 2.542 | 115.090 |
| WAD | 0.992 | 23.482 | 112.534 |
| |WAD| | 0.994 | 23.971 | 112.534 |
| Difference | 0.985 | 25.212 | 111.920 |
| Ratio | 0.997 | 26.429 | 121.228 |
| log2(Ratio) | 0.824 | 23.337 | 120.820 |
| FCROS | 0.758 | 19.228 | 324.413 |
| SoR | 0.756 | 23.087 | 264.746 |
| BWS | 0.900 | 2.696 | 289.215 |
| ReliefF | 0.840 | 6.394 | 912.471 |
| ReliefF ranked | 0.548 | 2.852 | 912.471 |
Overall sensitivity is defined as 1 - estimator from Storey’s method (the higher, the better). Overall false positive rate is defined as an absolute value of the difference between observed and expected false positive rate (the lower, the better)
Fig. 2Results of k-means cluster analysis based on three performance criteria. Results highlighted with green colour show good performance, red colour represents poor performance and yellow colour represents medium performance
Fig. 3Results of k-means cluster analysis based on two performance criteria. The best results have those metrics, where FPR estimation is closest to 0, and sensitivity estimation (1-) is closest to 1
Fig. 4Robustness of ranking metrics to sample size. Panel a represents surrogate sensitivity assessment of four best metrics for different sample size. Panel b represents FPR estimates under tested sample size
Fig. 5Results of detecting significant gene sets across various thresholds. Panel a represents percentage of significantly enriched pathways. Solid lines represent average value across analysed data sets whereas dashed lines represent its confidence intervals. Panel b represents percentage of significantly enriched pathways in experiment design dedicated to FPR evaluation. Red dashed line represents the expected outcome