| Literature DB >> 28361684 |
Ashwini Jeggari1, Andrey Alexeyenko2,3.
Abstract
BACKGROUND: The statistical evaluation of pathway enrichment, i.e. of gene profiles' confluence to the pathway level, allows exploring molecular landscapes using functionally annotated gene sets. However, pathway scores can also be used as predictive features in machine learning. That requires, firstly, increasing statistical power and biological relevance via a network enrichment analysis (NEA) and, secondly, a fast and convenient procedure for rendering the original data into a space of pathway scores. However, previous implementations of NEA involved multiple runs of network randomization and were therefore slow.Entities:
Keywords: Enrichment; Network analysis; Network benchmark; R package
Mesh:
Year: 2017 PMID: 28361684 PMCID: PMC5374688 DOI: 10.1186/s12859-017-1534-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Analysis flow in NEArender. The original matrix of 'omics' (mutation, methylation, expression etc.) data described a limited number of samples (patients etc.) with a much larger number of gene feature rows. At the first, preparatory step each sample was described via a characteristic sample-specific altered gene set (AGS). In parallel, a collection of functional gene sets (FGS) that share certain functionally annotations (within each set) was downloaded or prepared otherwise. A global gene/protein network (NET) was also provided (possibly selected from a number of alternatives based on benchmark results). In the course of network enrichment analysis (NEA) each AGS received as many NEA scores as there were FGSs, i.e. obtained coordinates in the multidimensional FGS space. This created an output matrix of the same number of sample columns but many fewer rows
Fig. 2Comparative sensitivity and sources of bias in randomization-based versus binomial calculation of network enrichment. Network enrichment between all vs. all 330 gene sets was analyzed with both NRZ and CSB methods. P-value distributions were compared using Q-Q plots (columns 1 and 2) and scatter plots of log (p) values (column 3). Q-Q plots in column 1 display both the total distributions (black lines), i.e. regardless of GS size, and distribution fractions that correspond to smaller GS (N AGS + N FGS, color lines). The QQ-plots in column 2 are insets of the black, un-stratified Q-Q plots of column 1. Identity lines (x = y) are plotted in dotted grey and dotted red in the QQ-plots and scatter plots, respectively. Analogous plots in regard of other factors biasing NRZ p-values (C AGS + C FGS and N edges) are provided in Additional file 1: Figure S1 (note that plot columns 2 and 3 are the same as in the present figure)
Fig. 3Agreement between biological replicates in alternative approaches to differential expression analysis. a, c, e: Examples of Spearman rank correlations between DE analyses without replicates on gingivial epithelial versus tenocyte cells using samples from two different donor pairs (gingivial epithelial: #4 and #5; tenocyte: #2 and #3). Since p-values in the non-replicated DE analyses were not available, the plot A (RAW) represents raw fold change values for 16620 genes, while plots C and E represent p-values for 330 FGSs. In C and E, the vertical and horizontal brown dotted lines delineate p-values significant after the Bonferroni correction. Dashed grey line: the linear regression fit of Y on X (the R values shown in the corners were calculated using the rank formula and are thus independent of the fits). NEA and GSEA values were obtained for AGSs representing DE genes in each analysis that satisfied the criterion abs (log 2 (fold change))>2, i.e. the 4-fold change in either direction. These example rank R values from A, C, and E are plotted at respectively B, D, and F as big orange dots. Thus in B, D, and F, the Spearman coefficients from A, C, and E (top left corner) as well as from all other pairwise comparisons are plotted as a function of relative difference strength between respective transcriptomes (X axis). As an example, according to the results of fully powered DE analysis for plots A, C, and E, 13.8% genes were DE with adjusted p-value < 0.05. Hence the value 0.138 is used as X-coordinate for the orange dots. The grey dotted linear regression line and the Spearman rank R value quantify the relations between X and Y. Black and colored points correspond to 1) correlation values of non-replicated DE analyses with each other and 2) correlations where one of the two analyses was replicated, respectively. The boxplots G and H summarize results across all cell types and AGS versions for non-replicated vs. half-replicated options (1) and (2). All pairwise contrasts of GSEA and RAW against NEA were significant with p(H0) < 0.001. In Additional file 1: Figure S3, six plots for NEA and GSEA represent results of using all the six alternative AGS versions