| Literature DB >> 33286201 |
Samarendra Das1,2,3, Craig J McClain4,5,6,7,8, Shesh N Rai2,3,5,6,9.
Abstract
Over the last decade, gene set analysis has become the first choice for gaining insights into underlying complex biology of diseases through gene expression and gene association studies. It also reduces the complexity of statistical analysis and enhances the explanatory power of the obtained results. Although gene set analysis approaches are extensively used in gene expression and genome wide association data analysis, the statistical structure and steps common to these approaches have not yet been comprehensively discussed, which limits their utility. In this article, we provide a comprehensive overview, statistical structure and steps of gene set analysis approaches used for microarrays, RNA-sequencing and genome wide association data analysis. Further, we also classify the gene set analysis approaches and tools by the type of genomic study, null hypothesis, sampling model and nature of the test statistic, etc. Rather than reviewing the gene set analysis approaches individually, we provide the generation-wise evolution of such approaches for microarrays, RNA-sequencing and genome wide association studies and discuss their relative merits and limitations. Here, we identify the key biological and statistical challenges in current gene set analysis, which will be addressed by statisticians and biologists collectively in order to develop the next generation of gene set analysis approaches. Further, this study will serve as a catalog and provide guidelines to genome researchers and experimental biologists for choosing the proper gene set analysis approach based on several factors.Entities:
Keywords: RNA-sequencing; competitive; gene set analysis; genome wide association study; microarrays; null hypothesis; sampling model; self-contained
Year: 2020 PMID: 33286201 PMCID: PMC7516904 DOI: 10.3390/e22040427
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1Outlines and classification of gene set analysis approaches. (A): Outlines of gene set analysis approaches; (B): Classification of gene set analysis approaches for high-throughput sequencing studies.
Figure 2Classification of gene set analysis approaches and tools available for microarrays. Schematic representation of the breakup of GSA methods available for microarrays data analysis based on statistical tests (i.e., null hypothesis, test statistic(s)) and requirement of annotation databases. G: Gene set; * Tools require normalization of data prior to application.
Generation-wise evolution of GSA approaches for microarray studies.
| Approach | Methodology | Advantages | Limitations | Tools/Algorithms |
|---|---|---|---|---|
|
| Hypergeometric distribution/Fisher’s test |
Easiness in execution. Assigns easily interpretable measure like p-values to the whole gene set. |
Highly dependent on threshold/cutoff value, which is at user’s discretion and hard to determine. Test statistic independent of genes differential expression score. Uses only most significant genes based on hard threshold and discards others, lead to information loss. Assumes each gene contribute equally to phenotype/trait. Assumes each gene as independent and ignores the correlation or redundancy among genes in gene set. Assumes that each predefined gene set is independent of others, which is erroneous. | DAVID [ |
|
| Wilcoxon signed rank test, Sum, Mean, or Median of gene-level statistic(s), Wilcoxon signed rank sum, Max-Mean Statistic |
Do not require a threshold/ cutoff value for dividing gene space into selected and non-selected part. Considers dependence among genes in gene set. Test statistic is based on the differential GE score of genes in gene set. |
Analyzes each gene set independently. Considers only the number of genes in a gene set (pathway) for performing GSA but ignores the additional information available from the bio-knowledge bases. Assumes the predefined gene sets mutually exclusive, but in biology, these gene sets are overlapping. Most ESA methods use differential GE to rank genes/compute test statistic but discard this information from further analysis. | GSEA [ |
|
|
|
Considers both genes relation /dependency with other genes as well as experimental condition changes. Considers the topology of the pathways/gene sets in modeling. |
Dependent on the type of cell due to cell-specific GE profiles and condition being studied, which is rarely available. Not so popular as require more rarely available information and computationally intensive. Unable to consider interactions between gene sets (pathways). Heavily dependent on annotations. | PathwayExpress [ |
Figure 3Classification of gene set analysis approaches and tools available for RNA-seq data analysis. Schematic representation of the breakup of GSA methods available for RNA-seq data analysis based on statistical tests (i.e., null hypothesis, test statistic(s)) and requirement of annotation databases. The first level of branching of the GSA methods based on their adaption from Microarrays practice to fit RNA-seq data as well as those specifically designed for RNA-seq. Subsequent branching depends on the different null hypotheses they test. G: Gene set. * Tools require normalization of data prior to application.
Generation-wise evolution of GSA approaches for RNA-sequencing studies.
| Approach | Methodology | Advantages | Limitations | Tools |
|---|---|---|---|---|
|
| Hypergeometric distribution, Fisher’s exact test |
Simple to use. Assigns easily interpretable measure like Less time consuming to interpret huge RNA-seq data. |
Use hard threshold approach to select gene sets. Assumes each transcript as independent and ignores the correlation or gene-gene interaction. Mostly dependent on annotation bases, but RNA-seq transcripts are not well annotated. | GoSeq [ |
|
| Wilcoxon signed rank test, Max-Mean Statistic |
Do not require a threshold for dividing gene space into selected and non-selected part. Considers dependence among genes in gene set. |
Use normalization technique to get microarray like data, hence, loss of the count nature of RNA-seq data Through data transformation, dispersion and other inherent nature of RNA-seq data are lost ES based tools/algorithms use differential score to prepare ranked transcript list but ignore this information for gene set testing. GSEA based tools like seqGSEA are computationally intensive, time consuming and and only offers the single gene set-level statistic. GSVA is not designed for gene set-based differential expression analysis between two phenotypically distinct sample groups. ES based GSA approaches do not consider the inherent zero inflation in the RNA-seq data. | AbsFilterGSEA [ |
Figure 4Classification of gene set analysis approaches and tools available for SNP data analysis. Schematic representation of the breakup of GSA methods available for SNP data analysis based on statistical tests and requirement of annotation databases. The first level of branching of the GSA methods based on their adaption from Microarrays to fit SNP data as well as those specifically designed for SNP data analysis. Subsequent branching depends on the different null hypotheses they test (i.e., null hypothesis, test statistic(s)). G: Gene set.
Generation-wise evolution of GWAS GSA approaches for SNP data analysis.
| Approach | Methodology | Advantages | Limitations | Tools/Algorithm |
|---|---|---|---|---|
|
| Hypergeometric distribution, Fisher’s exact test, Binomial test |
Simple to use and easy to interpret Assigns statistically convincing measure like p-value for SNP set, which is biologically meaningful Computationally not so expensive |
Hard threshold (arbitrary) divides the SNP list into selected and not selected SNP set. For instance, if threshold value for p-value is 0.05, means SNP with value 0.051 is not included in SNP list Uses only most significant SNP and discards others, lead to information loss Test statistic is independent of SNP data (based on only SNP count), and ignores the strength of association Considers each SNP independent and ignores the linkage disequilibrium Assumes each SNP contribute equally, which is not true as there are common and rare variants Dependent on pre-defined bio-knowledge base, which is mostly incomplete or unavailable | SNPtoGO [ |
|
| Wilcoxon signed rank test, Sum test, Weighted Sum test |
Do not require hard threshold for dividing SNP list into selected and non-selected part Jointly consider multiple contributing factors in the same gene set, might complement the most-significant SNPs/genes approach Test statistic is computed from the SNP data considering linkage disequilibrium |
Analyzes each gene set independently. Only considers data for selecting SNPs and after ignores the data from gene-set testing. Treat all genes in a gene set independently and do not account for the relationships between genes. | GSA-SNP [ |
|
| Graph/Network theory |
Relationships between genes are used to assign different levels of “importance” to genes in the set Helps in integrate gene set membership information with interaction data from a separate source |
Difficult to generalize True topology is dependent on the type of cell and experimental condition, which are rarely available Cannot model the dynamicity of the cellular system Heavily dependent on annotations, which is either missing or incomplete | dmGWAS [ |
|
| Linear regression Model, Ridge regression, Logistic regression, Linear models |
Consider both SNP and gene set information simultaneously in same model Jointly consider linkage disequilibrium and gene-gene interaction in gene set for modeling Future behavior of the system can be predicted Dynamicity of the biological system can also be modeled and studied |
Computationally intensive High dimensionality of genomic data raises serious concerns Ignores the non-linear interactions among biomolecules | LRpath [ |