Literature DB >> 28334094

FUNNEL-GSEA: FUNctioNal ELastic-net regression in time-course gene set enrichment analysis.

Yun Zhang¹, David J Topham², Juilee Thakar^1,2, Xing Qiu¹.

Abstract

MOTIVATION: Gene set enrichment analyses (GSEAs) are widely used in genomic research to identify underlying biological mechanisms (defined by the gene sets), such as Gene Ontology terms and molecular pathways. There are two caveats in the currently available methods: (i) they are typically designed for group comparisons or regression analyses, which do not utilize temporal information efficiently in time-series of transcriptomics measurements; and (ii) genes overlapping in multiple molecular pathways are considered multiple times in hypothesis testing.
RESULTS: We propose an inferential framework for GSEA based on functional data analysis, which utilizes the temporal information based on functional principal component analysis, and disentangles the effects of overlapping genes by a functional extension of the elastic-net regression. Furthermore, the hypothesis testing for the gene sets is performed by an extension of Mann-Whitney U test which is based on weighted rank sums computed from correlated observations. By using both simulated datasets and a large-scale time-course gene expression data on human influenza infection, we demonstrate that our method has uniformly better receiver operating characteristic curves, and identifies more pathways relevant to immune-response to human influenza infection than the competing approaches.
AVAILABILITY AND IMPLEMENTATION: The methods are implemented in R package FUNNEL, freely and publicly available at: https://github.com/yunzhang813/FUNNEL-GSEA-R-Package . CONTACT: xing_qiu@urmc.rochester.edu or juilee_thakar@urmc.rochester.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2017 PMID： 28334094 PMCID： PMC5939227 DOI： 10.1093/bioinformatics/btx104

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Microarrays and RNA-seq have made simultaneous expression profiling of thousands of genes across several experimental/clinical conditions widely accessible. However, interpreting the profiles from such large numbers of genes remains a key challenge. An important conceptual advance in this area was the shift from a focus on differential expression of single genes to testing sets of biologically related genes (Mootha ; Subramanian ). Here gene sets are typically defined a priori to include genes that share some common biologically relevant properties (e.g. members of the same metabolic pathway, having a common biological function, presence of a binding motif etc.). In addition to the advantage in interpretability, another benefit of analyzing gene sets instead of individual genes is that small changes in gene expression are unlikely to be captured by conventional single-gene approaches, especially after correction for multiple testing (Mootha ). Due to these advantages, gene-set hypothesis testing has become a popular research area and many methods have been developed (Dinu ; Jiang and Gentleman, 2007; Kim and Volsky, 2005; Luo ; Oron ; Saxena ; Wu ; Wu and Smyth, 2012; Yaari ) in recent years to improve the original GSEA procedure (Subramanian ). For example, some recently developed gene set tests such as CAMERA (Wu and Smyth, 2012) and its extension (Yaari ) include adjustments for inter-gene correlation. Such adjustments are necessary because inter-gene correlation can increase the false discoveries of many differential expression tests (Qiu , 2013, 2014; Qiu and Yakovlev, 2006, 2007) and gene-set tests (Breslin ; Dørum ; Wu and Smyth, 2012) substantially and render the results highly variable. Another under-developed area is time-course gene set analyses. Although some inferential tools (Conesa ; Luan and Li, 2004; Park ; Sohn ; Storey ; Wu and Wu, 2013) are available for detecting temporally significant genes, only a handful existing methods are specifically designed for time-course gene set analyses (Hejblum ; Nueda ; Wang , 2009a,b; Zhang ) based on generic analytical tools such as linear mixed effect regression (Wang , 2009a,b), principal component analysis (PCA, Nueda ), and B-splines (Hejblum ). Increasingly, personalized approaches are applied to transcriptome analyses to study subject-specific responses. Unlike genetically identical mice, transcriptomic studies across human population have revealed that human subjects exhibit large variation in responses to biological conditions across subjects. Large between-subject variation can reduce the statistical power significantly in a standard cross-sectional study. This problem can be mitigated by incorporating subject-specific information (e.g. baseline measurements before intervention or infection in a longitudinal analysis (Thakar ; Tsang ). Moreover, to quantify the heterogeneity of subject-specific responses may be an important aspect of a study (Henn ; Wu and Wu, 2013). This consideration has led to the development of the single sample GSEA (Barbie ) that uses the differences in empirical cumulative distribution functions of gene expression ranks inside and outside the gene set to calculate sample-specific enrichment statistics. In this study, we propose a new method based on functional PCA (FPCA, Ramsay and Silverman, 2005). It can detect arbitrary non-constant trends in time-course data analysis and is superior to several competing omnibus tests based on B-splines (Sohn ; Storey ), mainly because eigen-functions selected by FPCA form an orthogonal functional basis that explains more L2-variation of the entire transcriptome than any other basis. Moreover, FPCA is applied to each subject separately to improve the ability to identify subject-specific variations. Population-based inference can then be made by aggregating P-values across subjects with a suitable meta-analysis tool such as Fisher’s combined probability test (Fisher, 1963). The availability of high-quality gene sets from publicly accessible databases, such as MSigDB (Liberzon ) and KEGG pathways (Kanehisa and Goto, 2000) is critical for the success and popularity of gene set tests. Because pathway definitions in the public repositories are typically curated from many studies therefore not context-specific, there is a remarkable overlap in these gene sets. For example, we use 186 CP:KEGG biological pathways provided from MSigDB database in this study. These pathways consist of 5267 unique genes, and 2278 (or 43.3%) of them belong to two or more pathways. Ignoring this overlap overweighs the importance of genes shared by multiple sets and increases the dependence of hypothesis tests, thus reduces statistical power and induce spurious type I error and instability of inferences at the gene set level (Gordon ; Qiu ; Qiu and Yakovlev, 2006, 2007). Moreover, we and others have studies context-specific activations of pathways (Hartmann ; Katanic ; Lee ; Segal ), which are more pertinent to various immune and stress responses in both human and yeast models. It will be difficult to design follow-up experiments if the significance of selected gene sets is largely driven by generic associations that are not specific to the biological conditions of interest. To address this issue, we developed a weighting method based on functional elastic-net regression to assess the functional similarities between a given gene and the sets to which it belongs. By design, we let the weights pertinent to one gene sum up to one; so that this gene is not over-counted in gene-set-level analyses. We also developed a generalized Mann-Whitney U (MWU) test for gene-set-level inferences, which incorporates both inter-gene correlation and the weights determined by the functional elastic-net regression. With the proposed method, we will be able to estimate condition-specific importance of genes in the pathways, which will facilitate future empirical investigations. Dubbed as FUNNEL-GSEA (FUNctioNal ELastic-net regression in Gene Set Enrichment Analysis) or simply FUNNEL, our method utilizes recent advances in functional data analysis and is the first method to directly account for the overlapping genes by decomposing them into fractions (weights) in gene-set-specific manner. In this study, we demonstrate that FUNNEL has better statistical power and uniformly better receiver operator characteristics (ROC) curves than two major competing methods, CAMERA (Wu and Smyth, 2012) and TcGSA (Hejblum ). The original CAMERA parametric test is designed for detecting linear trend only; with appropriately selected summary statistic, CAMERA can be extended for non-linear trends using its non-parametric test. Furthermore, we apply FUNNEL to temporal gene expression data collected from human subjects challenged with live influenza viruses to show increased sensitivity and identification of pathways more informative in describing the differences of two phenotypic groups (symptomatic and asymptomatic subjects) at the molecular level.

2 Materials and methods

Our method, FUNNEL-GSEA, consists of three components: (i) A gene-level summary statistic based on FPCA test (Ramsay and Silverman, 2005; Wu and Wu, 2013). (ii) A weighting method to decompose overlapping genes based on functional elastic-net regression. (iii) A generalized MWU test that incorporates both weights and inter-gene correlation to test significant gene sets. These components are described in the following subsections.

2.1 A gene-level summary statistic based on FPCA

Let be the pre-processed (i.e. normalized and log-transformed) expression level for the th gene at the th sampling time point. The observed data can be modeled as where is unknown gene-specific function of time, and is random noise. After subtracting the mean expression over time for each gene, we apply FPCA across all the centered expression values. The estimated per-gene expression curve is represented as where is the temporal sample mean expression; is the th eigen-function; and is the functional principal component (FPC) score that quantifies how much can be explained by . Empirical evidences show that the first three eigen-functions (L = 3) explain about 86% of total variance of the real data (see Supplementary Material Section S1), and thus can represent the overall expression pattern of a given gene set. This high-level of variance explained by just a few eigen-genes is very common in FPCA analyses of time-course gene expression data (Qiu ; Wu ; Wu and Wu, 2013). As a comparison, we performed standard PCA in the time direction on the same expression data. Top 3 principal components only explain about 53% of total variance and it would require at least 8 principal components to explain about 86% of variance (see Supplementary Material Section S1). The main reason that FPCA is much more efficient than PCA in explaining data variation is that FPCA uses roughness penalty to achieve smoothness of temporal functions; in doing so it ‘borrows information’ across time points and reduces a large proportion of spurious variation pertaining to the i.i.d. measurement error. For time-course gene expression data, a temporally differentially expressed gene can be defined as a gene with significant non-constant expression pattern across time points. In other words, we want to test the following hypotheses Under , is estimated by a function with constant value (the sample mean expression for the th gene); under , defined in Equation (1) is used instead. We use the following functional -statistic to summarize the information contained by each gene over time: where and are the residual sum of squares under the null and alternative hypotheses, respectively. This summary statistic can be viewed as a ‘signal-to-noise’ ratio for functional data. The larger is, the more significant the th gene is.

2.2 Decomposing overlapping genes based on functional elastic-net regression

We use the following concurrent functional linear model to decompose an overlapping gene between gene-sets Here is the set of gene sets where the th gene belong; , , is the linear coefficient w.r.t. , the th eigen-function obtained from performing FPCA on the th gene set. represents the temporal signal of the th gene attributed to the th gene set; and is the noise function that cannot be explained by any gene set. We denote the set of eigen-functions as a vector of functions ; the set of linear coefficients in Equation (2) as a vector, which can be estimated by the following optimization problem Here is the LASSO (Tibshirani, 1996) penalty coefficient and is the ridge (L2) penalty coefficient. We need LASSO penalty because sparcity in enhances the biological interpretability. We also need ridge penalty to account for possible collinearity problems because some genes are shared by many gene sets, which implies that a large number of covariates (eigen-functions) may be used in regression. Besides, although eigen-functions pertain to one gene set are independent by construction, there may be high correlation between certain eigen-functions estimated from different gene sets. In this case, adding ridge penalty can make the parameter estimation more stable. By using the terminology from multivariate regression, we call the above optimization problem as functional elastic-net regression (Zou and Hastie, 2005). Most currently available penalized functional linear regression methods (Goldsmith ) focus on using Tikhonov regularization (semi-positive-definite penalty) to achieve smoothness and computational stability of functional linear regression. Model selection methods such as LASSO and group SCAD (Fan and Li, 2001; Wang ) regularization were recently used in functional linear regression but most of them (Collazos ; Gertheiss ; James ; Lee and Park, 2012; Matsui and Konishi, 2011) are developed for standard functional regression model (functional covariates and scalar responses). Model selection methods for historical functional linear models (of which the concurrent model is a special case) were studied in (Harezlak ; Matsui ). These methods involve computational expensive fitting techniques that are not necessary for concurrent functional regression model. In this study, we took a different approach based on an equivalence relationship between the penalized concurrent functional regression and a standard multivariate regression. This approach is computationally efficient and highly flexible. Technical details of this approach can be found in Supplementary Material Section S2. The selection of the penalty coefficients can be found in Supplementary Material Section S6. Once is estimated, we define the estimated weight for the th gene in the th gene set to be We assign if gene belongs to the th gene set only. Due to the use of LASSO penalty, in some cases both the numerator and denominator may be zero, in which case we assign . The weighting vector of the th pathway is denoted by , where is the set of genes in this gene set.

2.3 Weighted MWU test with correlation

The MWU test is a rank-based non-parametric test that can be used in a competitive GSEA to test whether the median of gene-level summary statistics () sampled from the testing gene set is significantly greater than the median of sampled from the rest of the genome (called the background genes). Unlike the setting of classical MWU test in which all observations are assumed to be independent, genes are known to be correlated with each other, especially within one biological pathway. Consequently, the classical MWU test must be adapted to accommodate with such correlation. As a special case, CAMERA assumes that genes in the testing gene set share a common pairwise correlation (interchangeable correlation structure) and the background genes remain independent. In this study, we further extended the modified MWU test used in CAMERA to allow the use of weights, which reflect the empirical membership of the overlapping genes assigned to the test gene set. Suppose we want to test the th gene set which contains member genes. We denote the number of background genes as . We define the weighted MWU statistic for this gene set as It is worth noting that applying weights to the Mann-Whitney style of ranks is not equivalent to applying weights to the Wilcoxon style of ranks, which are the ranks of observations in both samples. Under the assumption of interchangeability, it is easy to show that under Here can be considered as the effective sample size of the testing gene set. We also compute the variance of under as where is the inter-gene correlation; and are two constants depending on weights only. More details can be found in Supplementary Material Section S3. Because is signless (the larger , the more significant), we apply a one-sided -test based on the following standardized weighted rank sum statistic Similar to CAMERA, we set the degrees of freedom of the -distribution to , to reflect the precision with which is estimated. Our test reduces to CAMERA’s MWU test when ; which further reduces to the usual MWU test when . As a summary, we illustrate the overall structure of FUNNEL-GSEA in Figure 1.

Fig. 1.

An illustration of the FUNNEL-GSEA analysis pipeline

2.4 Human influenza infection data

Gene expression data of human influenza infection (Huang ; Woods ) were downloaded from Gene Expression Omnibus (Edgar ) series GSE52428. A total of 17 (9 symptomatic and 8 asymptomatic) subjects infected by influenza A H3N2/Wisconsin virus were studied. Whole transcriptome expression data had been sampled at 16 time points from one day before the infection till 5 days after the infection. Gene expressions were log2-transformed and filtered based on inter-quantile range (IQR ≥ 0.3). After non-specific filtering, ∼10 000 genes were reported for each subject. These gene expressions were standardized by the z-score transformation.

2.5 Simulations

Simulated data were generated by a linear mixed effect model with 16 time points (as in the real data) and 5000 genes with two (small and large) noise variance parameters. These genes were grouped into 90 synthesized pathways such that 3000 of them belong to a single pathway and the rest 2000 are shared by more than one pathway. 18 out of these 90 pathways were assigned with true signals generated from linear combinations of three designed signal patterns, to represent linear trend, sinusoid trend, and late (linear) response, respectively. We simulated cases with small and large variance. The simulation was repeated for 20 times for each case. Technical details of these simulated data, as well as some additional simulations based on biological signals and correlation structures more general than the interchangeable structure can be found in Supplementary Material Section S5.

3 Results

3.1 Large proportion of genes are shared by multiple gene sets

Functionally related genes in specific gene sets or pathways are static instances derived frequently by curation and recently by meta-analysis (Ruparelia ; Tan ). The overlap between these gene-sets is inevitable given modular topology of biological response. For example, NFkB related genes can be induced upon stimulation by several cytokines. However, exact instance of activation of NFkB regulated genes in a specific infection might not be derived by all the cytokines that can activate those genes. Specifically, we use 186 pathways from MSigDB (category CP:KEGG), which have 5267 unique genes. Among them, 2278 (or 43.3%) genes belong to two or more pathways; the two most-overlapped genes (MAPK1 and MAPK3) belong to 46 pathways (see Fig. 2 for more details).

Fig. 2.

Distribution of overlapping genes among KEGG pathways curated by MSigDB. Each bin in the histogram represents one row in Table (a). In total, 5267 unique genes pooled from 186 CP:KEGG pathways are used in this illustration. Among them, 2278 (or 43.3%) genes belong to two or more pathways; the two most-overlapped genes (MAPK1 and MAPK3) belong in 46 pathways

3.2 Estimating the empirical membership (weights) of overlapping genes

Given a gene associated with multiple gene sets, its context-specific activation could be indeed mediated by all the gene sets, only one of the gene sets, or none of the gene sets. Virtually all currently available gene set tests simply assume that the overlapping genes are activated by all gene sets to which they belong. This practice is equivalent to always assign for estimated weights and is henceforth called the naïve method. In order to evaluate weight assignment, we developed simulated data sets where the real membership of each gene to its gene-sets () is known using a linear mixed effect model (see Section 2 and Supplementary Material Section S5). The mean squared error (MSE) of the empirical membership (weights) estimated by FUNNEL averaged across 20 replications was 0.13 for the small variance simulations and 0.18 for the large variance simulations. Here for each simulation is defined as where , and is averaged over 20 replications. To put the accuracy of weight estimation in context, we took a closer look at genes that are shared by two pathways and compare the MSEs associated with FUNNEL with the naïve method. For the clarity of discussion, we consider binary weights (provide MSE at binary level) and dichotomize the estimated weights to be zero if and one if . The results for the small variance case are listed as follows (see Supplementary Tables S1 and S2 for the large variance case). In summary, FUNNEL was able to estimate the true weights better than the naïve method in all cases. Case I. On average (over 20 repetitions), 495.9 genes belong to two insignificant gene sets that do not carry any true temporal signals (true weight = (0,0)). FUNNEL correctly assigned zero weights 82% of times (MSEFUNNEL = 0.18 and MSEnaïve = 1). Case II. On average, 267.75 genes belong to one significant and one insignificant pathways (true weights = (0,1) or (1,0)). FUNNEL assigned the correct weights 83% of times (MSEFUNNEL = 0.17 and MSEnaïve = 0.5). Case III. On average, 13.35 genes belong to two significant gene sets. Because the true weight are in continuous scale, we only report the MSE for the continuous estimates for this case: MSEFUNNEL = 0.20 and MSEnaïve = 0.26.

3.3 Performance of gene set tests in simulation studies

Once weights are estimated, we apply the weighted MWU test with correlation (see Section 2) to test the significance of 90 synthesized pathways. Table 1 summarizes the statistical power and type I error for FUNNEL and CAMERA for both small and large variance cases. We find that FUNNEL always has better statistical power than CAMERA. Next, we conduct receiver operating characteristic (ROC) analyses for both methods. Figure 3 shows that the ROC curves of FUNNEL dominate that of CAMERA uniformly. In both small and large variance cases, the area under the ROC curve (AUC) of FUNNEL is larger than that of CAMERA, and such differences are significant based on a non-parametric AUC test (DeLong ).

Table 1.

Mean (STD) of statistical power and type I error at 5% significance level for FUNNEL and CAMERA in 20 replicates of simulation

		Small variance	Large variance
FUNNEL	Power	0.825 (0.049)	0.714 (0.049)
	Type I error	0.001 (0.003)	0.010 (0.013)
CAMERA	Power	0.703 (0.049)	0.508 (0.101)
	Type I error	0.001 (0.003)	0.001 (0.004)

Fig. 3.

ROC curves of FUNNEL (black, dashed) and CAMERA (grey, solid) for different signal to noise levels. ROC curves were plotted for small (left) and large (right) variances in the noise level of the simulated data (see Section 2 for the details). ROC curve of FUNNEL dominates that of CAMERA in both cases. The differences of AUC in both cases are highly significant based on the AUC test Mean (STD) of statistical power and type I error at 5% significance level for FUNNEL and CAMERA in 20 replicates of simulation Next, we compare the performance of TcGSA on the same set of simulation data. Unlike FUNNEL or CAMERA, TcGSA is a self-contained gene set test; so the comparison is only exploratory. Out of the 90 synthesized pathways, TcGSA identifies all of them being significant, irrespective of whether the pathways have true signal or not. The potential reasons for such poor performance may include technical failures [such as convergence problem in the likelihood model and size limitation of gene set, which is documented in (Hejblum )], and the ignorance of overlapping among pathways.

3.4 Analyses of time-course H3N2 infection data

The time-course gene expression data used in our study were collected from 9 symptomatic and 8 asymptomatic subjects infected by influenza A H3N2/Wisconsin virus (see Section 2). We apply FUNNEL and CAMERA to test the significance of 186 KEGG-derived pathways provided by MSigDB for each subject. We use functional -statistic as the gene-level summary statistic for both procedures. For CAMERA, gene-set-level inference is made by the Wilcoxon rank sum test with adjustment for correlation. We use the same correlation estimates and the degrees of freedom in both FUNNEL and CAMERA (see Supplementary Material Section S4). The resulting subject-level -values are combined by Fisher’s combined probability test (Fisher, 1963) for each (symptomatic and asymptomatic) group. After applying Bonferroni procedure to control for the familywise error rate (FWER) at 0.05 level, FUNNEL is able to identify 32 significant pathways from the symptomatic group (n = 9) which include pathogen recognition receptor signaling pathways such as the Cytosolic DNA sensing, NOD-like receptor signaling, RIG-I-like receptor signaling, and Toll-like receptor signaling pathways. Adaptive immune response related pathways such as the B cell receptor signaling and T cell receptor signaling pathways are also significant. All of the above pathways are significant in 6 or more (≥67%) number of symptomatic subjects. For the asymptomatic group (n = 8), FUNNEL identifies 22 significant pathways, including the B cell receptor signaling, Antigen processing and presentation, and Ribosome pathways. Fewer subjects (≤50%) in the asymptomatic group show activation of the above pathways. We present a heatmap of -log10-transformed -values for the 32 significant pathways selected from the symptomatic subjects in Figure 4. In contrast, CAMERA only identifies four pathways with very general biological functions (the Glutathione metabolism, Ribosome, Proteasome and Primary immunodeficiency pathways) for the symptomatic group and seven (the Oxidative phosphorylation, Ribosome, Spliceosome, Proteasome, Protein export, B cell receptor signaling and Parkinsons disease pathways) for the asymptomatic group. We also list top 30 most significant (ranked by adjusted P-values) pathways selected by CAMERA for the symptomatic group in Supplementary Material Section S7 for a more holistic comparison. In conclusion, FUNNEL has improved sensitivity to identify relevant pathways than CAMERA.

Fig. 4.

A Heatmap of -log10-transformed P-values for all 32 significant CP:KEGG pathways selected by FUNNEL (ordered by adjusted Fisher’s P-values for the symptomatic group). The P-values computed from individual subjects were combined by Fisher’s combined probability test in each group. Bonferroni procedure for multiple testing adjustment was applied to the combined Fisher’s P-values to control for FWER. Many of these pathways listed here, such as the B-, T-cell receptor signaling, NOD-like signaling, RIG-like receptor signaling, Toll-like receptor signaling, Chemokine signaling, JAK-STAT signaling, FC-Epsilon-RI signaling, MTOR signaling pathways are well documented immune signaling pathways that send signals that lead to the activation of various cell-specific immune activities (Color version of this figure is available at Bioinformatics online.)

3.5 The utility of estimated empirical memberships

We illustrate the utility of the empirical memberships (weights) estimated from penalized functional linear regression in Figure 5. Shown in Figure 5A are the color-coded empirical memberships of top 10 overlapping genes with the largest average weights in the Complement and coagulation cascades pathway, estimated from nine symptomatic subjects. We can see that C3AR1 is assigned to Complement and coagulation cascades pathway almost exclusively for all subjects. According to MSigDB, C3AR1 can potentially be assigned to two pathways: the Neuroactive ligand-receptor interaction and Complement and coagulation cascades pathway. We found that the temporal pattern of C3AR1 resembles the Complement and coagulation cascades pathway most closely (Fig. 5B). This empirical evidence suggests that while C3AR1 can potentially be activated by diseases related to neuro-active ligand receptors such as thyroid hormone resistance syndrome (Cheng, 2005) and woolly hair (Shimomura , 2009, 2010), it is exclusively activated by the Complement and coagulation cascades pathway for the specific biological condition (H3N2/Wisconsin influenza infection) of the study.

Fig. 5.

(A) Estimated weights. Empirical memberships (weights) of genes in the Complement and coagulation cascades pathway estimated from applying the penalized functional linear regression analysis to all symptomatic subjects. 10 overlapping genes with the largest average estimated weights are shown in this figure. NA cells are due to genes being removed by the non-specific filtering criterion (IQR < 0.3). (B) Temporal expression pattern of C3AR1 (black circles/curve) from Subject 7. This gene belongs to two CP:KEGG pathways: the Neuroactive ligand-receptor interaction (red curve) and Complement and coagulation cascades (blue curve) (Color version of this figure is available at Bioinformatics online.)

4 Discussion

Time-series gene expression data have gained popularity in recent years due to their application in the translation studies. Identifying early changes in the gene-expression that are predictive of future responses to the diseases, infections or vaccinations can be predicted by applying advanced mathematical modeling tools such as high-dimensional differential equations (Lu ; Qiu ; Wu , 2014), dynamic Bayesian network (Perrin ; Zou and Conzen, 2005), or Granger’s model (Lozano ; Shojaie and Michailidis, 2010) to study the dynamic and causal relationship between genes based on changes of expressions over many time points. Unlike group comparisons, advanced inferential tools such as non-parametric regression (Müller, 2012) and functional data analysis are required to detect biological signals presented in the form of non-linear temporal trends of gene expression profiles most efficiently. Only a handful gene set analysis methods are designed to detect arbitrary non-linear temporal trend at this moment and there are much room for improvement. For example, TcGSA (Hejblum ) has an option to use either cubic polynomials or natural cubic splines to model non-linear temporal trends. However, TcGSA assumes the same trend among all genes in a set, which is unrealistic in many situations. PCA-maSigFun, developed by Nueda et al. (Nueda ), does have the ability to account for possible heterogeneity inside a gene set; but its use of standard PCA and linear regression is suboptimal for time-course data with complex non-linear patterns. On the other hand, CAMERA (Wu and Smyth, 2012) is a generic GSEA framework that can be used in time-course gene expression analysis as well. CAMERA adjusts for inter-gene correlation in the gene-set-level inference; is more efficient than procedures based on permuting samples; and is more flexible and robust than approaches based on estimating or approximating the correlation matrix (Nam, 2010; Wang , 2009a). That being said, CAMERA’s performance depends largely on the efficiency of the summary statistics used to capture temporal trends. In FUNNEL-GSEA, we use functional PCA to capture arbitrary and possibly heterogeneous non-linear trends within gene sets, which is known to be more efficient than competing methods such as splines (Sohn ; Storey ). Although our method is designed with a focus on subject-specific analyses; group-level results can be obtained by a suitable meta-analysis method. We choose to combine individual P-values by Fisher’s combined probability test in this study because it allows the detection of gene sets that are activated by influenza infection with subject-specific expression patterns, which is more flexible than those competing approaches that depend on detecting common expression patterns across all subjects. This approach also enables us to study the heterogeneity of subject-level responses, e.g. in Figures 4 and 5A. Although Fisher’s combined probability test is arguably the most widely used meta-analysis procedure, we will explore other meta-analysis options that have better protection against type I error in the future. GSEA facilitates comparisons across independent studies performed on different platforms and techniques by assembling gene-sets from available data-sets in the public repositories. The problem of overlapping gene-sets is exacerbated when these data are obtained under different experimental conditions. For example, C3AR1 is a protein coding gene that could be assigned to the Neuroactive ligand-receptor interaction and Complement and coagulation cascades pathway; yet within the context of influenza viral infection, empirical evidences show that it is almost entirely driven by the Complement and coagulation cascades pathway. To our best knowledge, no currently available GSEA methods has the ability to assign conditional pathway memberships to overlapping genes like C3AR1, and they simply count the summary statistics of overlapping genes in all gene sets to which they may belong in gene-set analyses. This naïve practice can inflate type I error if overlapping genes are signal-carrying genes assigned to irrelevant pathways, and/or reduce statistical power when many irrelevant null genes are assigned to an informative pathway. FUNNEL-GSEA assigns weights (empirical pathway membership) of overlapping genes based on penalized functional linear regression that reflects the functional similarities between overlapping genes and the gene sets they belong to. Such empirical memberships are more specific and relevant to the experimental conditions than generic associations provided by public databases. Furthermore, we require weights pertain to one gene to sum up to one; so that this gene is not over-counted in gene-set-level analyses. We also want to discuss the applicability of functional data analysis, which is designed for continuous data, to RNA-seq-based expression data (Garber ; Mortazavi ; Wang ). Unprocessed RNA-seq reads are discrete random variables that are commonly represented by the negative binomial model (Anders and Huber, 2010; Hardcastle and Kelly, 2010; Robinson ) or NBP model (Di ). Several recent studies (Law ; Rapaport ; Ritchie ) show that pre-processing techniques such as non-specific filtering, normalization, and log transformation can greatly reduce the granularity of raw reads so that analytic tools designed for continuous high-throughput data have comparable or even better performance on these data as compared with specialized tools based on discrete models. We believe this is largely due to the removing of genes with very low reads in modern gene expression analysis pipelines such as DESeq2 (Love ) and LIMMA (Smyth, 2005). The remaining genes have large numbers of reads that can be well approximated by continuous distributions such as the normal distribution (after log2 transformation). In the future, we plan to extend FUNNEL-GSEA so that it has an option to use summary statistics that are designed for discrete time-course models to summarize gene-level information; then use standard PCA and linear regression to assign weights for overlapping genes. Such an extension may be more suitable for un-normalized and un-filtered time-course RNA-seq data. Click here for additional data file.

71 in total

1. KEGG: kyoto encyclopedia of genes and genomes.

Authors: M Kanehisa; S Goto
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

Review 2. Thyroid hormone receptor mutations and disease: beyond thyroid hormone resistance.

Authors: Sheue-yann Cheng
Journal: Trends Endocrinol Metab Date: 2005 May-Jun Impact factor: 12.015

3. Correlation between gene expression levels and limitations of the empirical bayes methodology for finding differentially expressed genes.

Authors: Xing Qiu; Lev Klebanov; Andrei Yakovlev
Journal: Stat Appl Genet Mol Biol Date: 2005-11-22

4. Some comments on instability of false discovery rate estimation.

Authors: Xing Qiu; Andrei Yakovlev
Journal: J Bioinform Comput Biol Date: 2006-10 Impact factor: 1.122

Review 5. Computational methods for transcriptome annotation and quantification using RNA-seq.

Authors: Manuel Garber; Manfred G Grabherr; Mitchell Guttman; Cole Trapnell
Journal: Nat Methods Date: 2011-05-27 Impact factor: 28.547

6. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205

7. De-correlating expression in gene-set analysis.

Authors: Dougu Nam
Journal: Bioinformatics Date: 2010-09-15 Impact factor: 6.937

8. Disruption of P2RY5, an orphan G protein-coupled receptor, underlies autosomal recessive woolly hair.

Authors: Yutaka Shimomura; Muhammad Wajid; Yoshiyuki Ishii; Lawrence Shapiro; Lynn Petukhova; Derek Gordon; Angela M Christiano
Journal: Nat Genet Date: 2008-02-24 Impact factor: 38.330

Review 9. RNA-Seq: a revolutionary tool for transcriptomics.

Authors: Zhong Wang; Mark Gerstein; Michael Snyder
Journal: Nat Rev Genet Date: 2009-01 Impact factor: 53.242

10. Acute myocardial infarction activates distinct inflammation and proliferation pathways in circulating monocytes, prior to recruitment, and identified through conserved transcriptional responses in mice and humans.

Authors: Neil Ruparelia; Jernej Godec; Regent Lee; Joshua T Chai; Erica Dall'Armellina; Debra McAndrew; Janet E Digby; J Colin Forfar; Bernard D Prendergast; Rajesh K Kharbanda; Adrian P Banning; Stefan Neubauer; Craig A Lygate; Keith M Channon; Nicholas W Haining; Robin P Choudhury
Journal: Eur Heart J Date: 2015-05-16 Impact factor: 29.983

9 in total

1. Bioinformatics identification of potential genes and pathways in preeclampsia based on functional gene set enrichment analyses.

Authors: Xue Li; Yanning Fang
Journal: Exp Ther Med Date: 2019-07-08 Impact factor: 2.447

2. FUNAGE-Pro: comprehensive web server for gene set enrichment analysis of prokaryotes.

Authors: Anne de Jong; Oscar P Kuipers; Jan Kok
Journal: Nucleic Acids Res Date: 2022-05-31 Impact factor: 19.160

3. Potential Genes and Pathways of Neonatal Sepsis Based on Functional Gene Set Enrichment Analyses.

Authors: YuXiu Meng; Xue Hong Cai; LiPei Wang
Journal: Comput Math Methods Med Date: 2018-07-30 Impact factor: 2.238

4. Highly efficient hypothesis testing methods for regression-type tests with correlated observations and heterogeneous variance structure.

Authors: Yun Zhang; Gautam Bandyopadhyay; David J Topham; Ann R Falsey; Xing Qiu
Journal: BMC Bioinformatics Date: 2019-04-15 Impact factor: 3.169

5. Aims, Study Design, and Enrollment Results From the Assessing Predictors of Infant Respiratory Syncytial Virus Effects and Severity Study.

Authors: Edward E Walsh; Thomas J Mariani; ChinYi Chu; Alex Grier; Steven R Gill; Xing Qiu; Lu Wang; Jeanne Holden-Wiltse; Anthony Corbett; Juilee Thakar; Matthew N McCall; David J Topham; Ann R Falsey; Mary T Caserta; Lauren Benoodt
Journal: JMIR Res Protoc Date: 2019-06-06