| Literature DB >> 33286800 |
Kevin Schneider1, Benedikt Venn1, Timo Mühlhaus1.
Abstract
The objective of gene set enrichment analysis (GSEA) in modern biological studies is to identify functional profiles in huge sets of biomolecules generated by high-throughput measurements of genes, transcripts, metabolites, and proteins. GSEA is based on a two-stage process using classical statistical analysis to score the input data and subsequent testing for overrepresentation of the enrichment score within a given functional coherent set. However, enrichment scores computed by different methods are merely statistically motivated and often elusive to direct biological interpretation. Here, we propose a novel approach, called Thermodynamically Motivated Enrichment Analysis (TMEA), to account for the energy investment in biological relevant processes. Therefore, TMEA is based on surprisal analysis, which offers a thermodynamic-free energy-based representation of the biological steady state and of the biological change. The contribution of each biomolecule underlying the changes in free energy is used in a Monte Carlo resampling procedure resulting in a functional characterization directly coupled to the thermodynamic characterization of biological responses to system perturbations. To illustrate the utility of our method on real experimental data, we benchmark our approach on plant acclimation to high light and compare the performance of TMEA with the most frequently used method for GSEA.Entities:
Keywords: GSEA; acclimation response; free energy; gene set enrichment analysis; information theory; pathway analysis; surprisal analysis; thermodynamics; transcription levels
Year: 2020 PMID: 33286800 PMCID: PMC7597090 DOI: 10.3390/e22091030
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1Schematic overview of the Monte Carlo permutation testing procedure used in Thermodynamically Motivated Enrichment Analysis (TMEA). Left to right: For a functionally annotated set of size () in the original dataset, the size of the positively and negatively contributing subsets is determined (). Subsequently, random samples are resampled from the weight distribution of the original constraint yielded by surprisal analysis from either the positive or negative part respectively, to generate b bootstrapped samples of sizes . Then, these samples are aggregated to generate weight sums for positive and negative weights each. Then, the frequency distributions of these weight sums are used to report empirical p-values, which inform how likely it is to observe the given positive or negative weight sum for bin sizes in the original constraint by chance based on the values above () and below () the observed value.
Figure A2Constraint relevance. (A) Data reconstruction obtained by using (i) baseline state (constraint 0), (ii) Constraints 0–1, (iii) Constraints 0–2, and (iii) Constraints 0–3. (B) Singular values of constraints 1–10. The combination of a reconstruction efficiency of 98.6% and the singular value amplitude drop at α = 4 with no strong further decrease indicates a sufficient information supply by constraints 1–3.
Figure 2Contribution weights in constraints carry information beyond the count of extreme values. (A) R2 as a measure of linear regression quality of weight sum ratio (WR) by count ratio (CR) is shown in dependence of the quantile used to split the weight distributions of annotated sets to generate these ratios in either all positive (red) or all negative (blue) weight distributions for annotated subsets of constraints 1–3. The ±15% truncated mean of each is shown as a point of the same color. The quantile that separates weights in constraints 1–3 so that it produces the 15% truncated mean R2 regression quality shown in the upper part of the figure was used to split the weights in positively (56% quantile, (C)) and negatively (48% quantile, (B)) contributing parts of annotated subsets of constraints 1–3. Subsequently, both WR and CR were calculated for all the annotated subsets in the dataset. These values are shown as either red (right) or blue (left) points on the scatter plots. Linear regression was performed, and the resulting line was plotted with a 95% confidence band. These plots correspond to the regressions for a single y-value on the top plot. The existence and increase of outliers in the high weight/count ratio region suggests that high weight items carry an especially large amount of information that is lost when using traditional methods.
Figure 3TMEA reports different weight distribution shapes for annotated subsets as significant. Histograms with a bin width of 0.0015 of both negatively (left part of the plots, blue) and positively (right part of the plots, red) contributing entities in the functionally annotated sets of (A) protein.synthesis.ribosomal protein for Constraint 1, (B) signaling.light for Constraint 2, and (C) RNA.regulation of transcription.MYB-related transcription factor family protein for Constraint 3 are plotted together with the respective overall distribution of weights (gray area plot) of the respective sign and constraint.
Figure A1Maximal reachable α-level at a given α-level of 5%. The discrete nature of the hypergeometric distribution prevents the significance to reach 0.05 exactly. There always is a range of α-level space that must be sacrificed leading to a lower α than intended. The heatmap shows the maximal reachable α-level given: N = total number of genes = 10,000; K = number of differentially expressed genes; n = bin size; k = minimal number of differentially expressed genes needed for p-value < 0.05; intended α-level = 0.05. Especially when the bin size is low, even the half of the intended α-level often cannot be reached. Note that the bin size ranges from 1 to 500 in steps of 5.
Figure 4Comparison of significant functionally annotated sets (FASs) obtained by hypergeometric distribution (hypGSEA) and TMEA. (A) Venn diagrams of significant FASs with a minimum size of 5 reported by hypGSEA, TMEA constraints 1–3, and TMEA constraints 4–10 for a comparison without threshold. (B) Heatmap of adjusted p-values obtained by hypGSEA and TMEA. Measured transcripts were labeled with their respective hypGSEA p-value and the minimal TMEA p-value obtained within the first three constraints. All TMEA-significant bins are clustered by k-means clustering with k = 6. (C) Visualization of all FAS reported significant by hypGSEA and/or TMEA. Detailed cluster information is given in Table A1. Bins that are not reported by TMEA are appended to the end of the heatmap with increasing hypGSEA p-values.
Figure 5TMEA and surprisal analysis identify three major transcription patterns governing high light acclimation in Arabidopsis thaliana and provide a concise functional description for them. (A) Time course of the three major constraint potentials (λα for α = 1,2,3) indicate the importance of the respective transcription pattern. The potentials of the first three constraints (λ1–λ3) are shown for four days of acclimation and four days of subsequent de-acclimation. While λ1 separates the experiment in two major phases, λ2 and λ3 show more fluctuating patterns, defining three or four states, respectively. (B) Free energy landscapes defined by the three major state variables. Energy levels are plotted for transcription patterns (F1–F3), their sum (F123), and the total free energy when using all constraints for free energy calculation (Ftotal). The dominant pattern is responsible for two of the three visible local energy minima. The least weighted pattern of the three is responsible for an energy minimum at the end of the time course. (C) Selected FASs reported by TMEA with significant influences on the respective constraints are listed. Directional influence (+ for positive, − for inverse) on the respective pattern is indicated.
Figure 6The role of Anthocyanins during high light treatment: (A) Anthocyanin content in Arabidopsis thaliana under 4 days of high light treatment (days 0–4) and 4 days of de-acclimation at ambient light condition (days 4–8). (B) Weight distributions of transcripts included in secondary metabolism.flavonoids demonstrating significant influences for constraints 1–3. TMEA reports a significance for the weight sums of all three constraints.
Figure 7Phenylalanine time course. Phenylalanine fold changes during 4 days of high light acclimation and 4 days of de-acclimation under ambient conditions show increased abundance 3 h to 1 day after condition change.
Significant FASs reported by TMEA in constraints 1–3. The p-values were clustered using the k-means clustering algorithm with a cluster number of 6 (cluster ID 1–6). The corresponding heatmap is depicted in Figure 4.
| cID | MapMan Annotation (FAS) | cID | MapMan Annotation (FAS) |
|---|---|---|---|
| 1 | cell wall.cell wall proteins | 4 | major CHO metabolism |
| 1 | cell wall.cell wall proteins.AGPs | 4 | major CHO metabolism.degradation |
| 1 | cell wall.cell wall proteins.AGPs.AGP | 4 | major CHO metabolism.degradation.starch |
| 1 | cell wall.pectin*esterases.misc | 4 | misc.invertase/pectin methylesterase inhibitor family protein |
| 1 | lipid metabolism.FA desaturation | 4 | not assigned.no ontology.DC1 domain containing protein |
| 1 | lipid metabolism.FA desaturation.desaturase | 4 | not assigned.unknown |
| 1 | misc.beta 1,3 glucan hydrolases | 4 | RNA.regulation of transcription.AP2/EREBP, APETALA2/Ethylene-responsive element binding protein family |
| 1 | misc.beta 1,3 glucan hydrolases.glucan endo-1,3-beta-glucosidase | 4 | RNA.regulation of transcription.C2C2(Zn) CO-like, Constans-like zinc finger family |
| 1 | misc.glutathione S transferases | 4 | RNA.regulation of transcription.C2C2(Zn) DOF zinc finger family |
| 1 | misc.nitrilases, *nitrile lyases, berberine bridge enzymes, reticuline oxidases, troponine reductases | 4 | RNA.regulation of transcription.MYB-related transcription factor family |
| 1 | misc.O-methyl transferases | 4 | RNA.regulation of transcription.Psudo ARR transcription factor family |
| 1 | misc.protease inhibitor/seed storage/lipid transfer protein (LTP) family protein | 4 | secondary metabolism.isoprenoids.terpenoids |
| 1 | nucleotide metabolism.synthesis.purine | 4 | secondary metabolism.phenylpropanoids.lignin biosynthesis |
| 1 | protein | 4 | stress.abiotic |
| 1 | protein.degradation.AAA type | 4 | stress.abiotic.cold |
| 1 | protein.synthesis | 4 | stress.biotic.respiratory burst |
| 1 | protein.synthesis.ribosomal protein | 4 | transport.sulfate |
| 1 | protein.synthesis.ribosomal protein.eukaryotic | 5 | cell wall |
| 1 | protein.synthesis.ribosomal protein.eukaryotic.40S subunit | 5 | cell wall.modification |
| 1 | protein.synthesis.ribosomal protein.eukaryotic.60S subunit | 5 | misc |
| 1 | protein.synthesis.ribosomal protein.prokaryotic.chloroplast | 5 | secondary metabolism |
| 1 | protein.synthesis.ribosomal protein.prokaryotic.chloroplast.50S subunit | 5 | secondary metabolism.flavonoids |
| 1 | protein.synthesis.ribosome biogenesis | 5 | secondary metabolism.flavonoids.anthocyanins |
| 1 | protein.synthesis.ribosome biogenesis.Pre-rRNA processing and modifications | 5 | secondary metabolism.flavonoids.anthocyanins.anthocyanin 5-aromatic acyltransferase |
| 1 | protein.synthesis.ribosome biogenesis.Pre-rRNA processing and modifications.snoRNPs | 5 | secondary metabolism.flavonoids.dihydroflavonols |
| 1 | protein.synthesis.ribosome biogenesis.Pre-rRNA processing and modifications.WD-repeat proteins | 5 | stress |
| 1 | redox.glutaredoxins | 5 | stress.biotic |
| 1 | RNA.regulation of transcription.ARR | 5 | transport |
| 1 | RNA.regulation of transcription.NAC domain transcription factor family | 6 | cell wall.degradation |
| 1 | RNA.regulation of transcription.WRKY domain transcription factor family | 6 | cell wall.degradation.mannan-xylose-arabinose-fucose |
| 1 | secondary metabolism.simple phenols | 6 | DNA.synthesis/chromatin structure.retrotransposon/transposase |
| 1 | signaling | 6 | DNA.synthesis/chromatin structure.retrotransposon/transposase.gypsy-like retrotransposon |
| 1 | signaling.in sugar and nutrient physiology | 6 | hormone metabolism |
| 1 | signaling.receptor kinases.DUF 26 | 6 | hormone metabolism.auxin |
| 1 | signaling.receptor kinases.misc | 6 | minor CHO metabolism |
| 1 | signaling.receptor kinases.wall associated kinase | 6 | minor CHO metabolism.trehalose |
| 1 | signaling.receptor kinases.wheat LRK10 like | 6 | minor CHO metabolism.trehalose.potential TPS/TPP |
| 1 | stress.biotic.PR-proteins.plant defensins | 6 | misc.gluco-, galacto- and mannosidases |
| 1 | transport.Major Intrinsic Proteins | 6 | not assigned.no ontology.glycine rich proteins |
| 2 | amino acid metabolism.synthesis | 6 | not assigned.no ontology.pentatricopeptide (PPR) repeat-containing protein |
| 2 | amino acid metabolism.synthesis.aspartate family | 6 | PS.lightreaction |
| 2 | development.storage proteins | 6 | PS.lightreaction.photosystem II |
| 2 | hormone metabolism.auxin.induced-regulated-responsive-activated | 6 | PS.lightreaction.photosystem II.LHC-II |
| 2 | nucleotide metabolism.synthesis | 6 | secondary metabolism.flavonoids.chalcones |
| 2 | protein.synthesis.ribosomal protein.eukaryotic.60S subunit.L7A | 6 | secondary metabolism.flavonoids.flavonols |
| 2 | protein.synthesis.ribosomal protein.prokaryotic | 6 | secondary metabolism.phenylpropanoids |
| 2 | signaling.calcium | 6 | signaling.light |
| 2 | stress.biotic.receptors | 6 | transport.ABC transporters and multidrug resistance systems |
| 2 | transport.Major Intrinsic Proteins.PIP | 6 | transport.sugars |
| 3 | misc.cytochrome P450 | ||
| 3 | misc.GDSL-motif lipase | ||
| 3 | misc.peroxidases | ||
| 3 | signaling.receptor kinases | ||
| 3 | stress.biotic.PR-proteins |