| Literature DB >> 30428831 |
Liis Kolberg1, Ivan Kuzmin1, Priit Adler1,2, Jaak Vilo1,2, Hedi Peterson3,4.
Abstract
BACKGROUND: A widely applied approach to extract knowledge from high-throughput genomic data is clustering of gene expression profiles followed by functional enrichment analysis. This type of analysis, when done manually, is highly subjective and has limited reproducibility. Moreover, this pipeline can be very time-consuming and resource-demanding as enrichment analysis is done for tens to hundreds of clusters at a time. Thus, the task often needs programming skills to form a pipeline of different software tools or R packages to enable an automated approach. Furthermore, visualising the results can be challenging.Entities:
Keywords: Data-driven; Functional enrichment analysis; Gene expression; Global visualisation; Hierarchical clustering; Microarray; Protoarray; RNA-seq; funcExplorer
Mesh:
Year: 2018 PMID: 30428831 PMCID: PMC6236982 DOI: 10.1186/s12864-018-5176-x
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Global overview of the Humoral dataset GSE11121 in the user interface of funcExplorer.a The main output shows a compact dendrogram with a heatmap aside. Clusters, indicated as rectangles, are size-scaled and color-coded according to significant functions from enrichment analysis. b Additional information in the form of tables is shown in tabs ‘Summary’, ‘Unique annotations’ and ‘Genes’. c Search field allows searching for interesting genes or functions. The results will be reported next to the field and are highlighted in the main view by dimming unrelated clusters. d Bottom of the page shows a report of output (number of clusters, gene coverage) and selected parameters (not shown in the figure). Clicking on a specific cluster or cluster-link redirects the user to a single cluster view. e Hovering over a cluster shows a tooltip with the most relevant information. f Wordcloud in the single cluster view shows functional topics; g Expression profiles characterise the behavior of the cluster across samples. The results of this figure are available for more detailed exploration at https://biit.cs.ut.ee/funcexplorer/link/22ebb
Fig. 2General pipeline of funcExplorer. Rectangles represent processes and cylinders represent data collections. The arrows indicate the direction of data flow while dashed arrows denote the optional path. The input dataset is gene expression data in a standard tab-separated form. Data preparation and analysis are carried out automatically and lead to a user-friendly interactive visualisation in the web browser, available for self-discovery
Fig. 3Compressing the dendrogram. The output of funcExplorer is compressed by showing only informative clusters (C1 (highlighted with green rectangle), C4) with colorful size-scaled rectangles. The colors denote the domains of significantly enriched functions (p-value≤0.05) in the cluster and are proportional to the number of annotations. The gray bar represents the proportion of annotated genes in the cluster. The enrichment scores used in the cluster detection algorithm are shown for every node below the cluster ID. If the ‘Show sparse clusters’ is selected, the clusters with no significant enrichment are shown with beige rectangles (C5), otherwise they are completely collapsed from the output
Fig. 4Properties of funcExplorer clusters in the example of CLEANsmall (left) and CLEANtotal (right) data analysis. a,b Cluster sizes after applying three cutting strategies to function-size-filtered (<700 genes) data. The number shows the total number of informative clusters detected. c,d The distribution of significant function sizes in the clusters with error bars (one standard deviation from average)
Fig. 5Tissue-specific clusters from GTEx data (I). Some examples of tissue-specific clusters detected by the funcExplorer best annotation strategy are highlighted in the figure. The results of this figure are available for more detailed exploration at https://biit.cs.ut.ee/funcexplorer/link/6afa1
Fig. 6Tissue-specific clusters from GTEx data (II). A downloadable summary report of selected tissue-specific clusters detected by funcExplorer. U is an indicator of the presence of unique annotations in a cluster. The top functions of every domain are shown in the Domain Best Annotations table. Up to 20 top functions are represented in the topic word cloud. Eigengene profile is a representative of the expression levels of a given cluster across all the tissues. Clusters highly expressed in a specific tissue are significantly enriched with corresponding tissue-related functions. The results of this figure are available for more detailed exploration at https://biit.cs.ut.ee/funcexplorer/link/6afa1
Fig. 7Fixed-cut at h=1.3 versus F1 strategy in CLEANsmall. The red rectangle on top of the dendrogram on the left highlights an example cluster obtained from the fixed-cut approach at a height of 1.3 (blue line). The F1 strategy (on the right) detected five smaller clusters from the same branch that come from various levels of tree height and cannot be detected together using one single cut. All these clusters are defined by a unique GO term which indicates a functional difference between these gene groups. For example, cluster 1178 is enriched in the T-cell receptor signaling pathway, whereas cluster 967 is enriched in the B-cell receptor signaling pathway
Comparison of funcExplorer results with previous studies
| Previous studies | funcExplorer | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| First | Best | F1 | ||||||||||||||
| Dataset | Organism | n | k | C (%) | k | C (%) | Rand(Adj.Rand) | Filt. Rand(Adj.Rand) | k | C (%) | Rand(Adj.Rand) | Filt. Rand(Adj.Rand) | k | C (%) | Rand(Adj.Rand) | Filt. Rand(Adj.Rand) |
| CLEANsmall [ | H. sapiens | 1,422 | 8 | 100% | 3 | 100% | 0.74(0.39) | 0.74(0.39) |
|
|
|
| 33 | 58.09% | 0.76(0.32) | 0.91(0.76) |
| Arabidopsis [ | A. thaliana | 4,982 | 8 | 74% | 7 | 80.08% | 0.86(0.47) | 0.94(0.81) | 8 | 79.45% | 0.88(0.50) | 0.96(0.86) |
|
|
|
|
| Humoral [ | H. sapiens | 2,579 | 11 | 46.5% | 4 | 98.68% | 0.66(0.11) | 0.66(0.12) |
|
|
|
| 36 | 30.52% | 0.55(0.10) | 0.80(0.39) |
| Yeast [ | S. cerevisiae | 1,340 | 6 | 100% | 2 | 100% | 0.52(0.07) | 0.52(0.07) | 21 | 49.85% | 0.68(0.18) | 0.78(0.29) |
|
|
|
|
*Note: n = number of genes in the data; k = number of clusters; C (%) = gene coverage (% of genes distributed into the k clusters); Rand = Rand index (value between 0 and 1); Adj.Rand = Adjusted Rand index (corrected-for-chance version of the Rand index; value between -1 and 1); Filt. Rand/Adj.Rand = indexes calculated after excluding unclustered genes; the values used in the further comparisons are highlighted in bold; funcExplorer parameters for CLEANsmall and Humoral: p-value≤0.001; Arabidopsis: p-value≤10−7; Yeast: p-value≤0.01; no limit on term size
Comparison of funcExplorer results of CLEANsmall dataset
| CLEAN clusters[ | Corresponding funcExplorer clusters | |
|---|---|---|
| Cluster ID (size) | Top GO terms BP, CC, MF | |
| 1. | ID: 117 (250 genes) | Cell cycle, macromolecular complex, nucleic acid binding |
| 2. Mitochondrion (125 genes) | ||
| 3. Mitosis, | ID: 63 (71 genes) | Mitotic cell cycle process, nuclear lumen, protein binding |
| 4. DNA replication, | ||
| 5. | ID: 5 (657 genes) | Multicellular organismal process, intrinsic component of plasma membrane, receptor activity |
| 6. | ID: 5 (657 genes) | Multicellular organismal process, intrinsic component of plasma membrane, receptor activity |
| ID: 91 (77 genes) | Extracellular matrix organization, proteinaceous extracellular matrix | |
| 7. | ID: 8 (170 genes) | Immune system process, plasma membrane, receptor activity |
| ID: 305 (105 genes) | Immune system process, plasma membrane part, receptor activity | |
| 8. | ID: 5 (657 genes) | Multicellular organismal process, intrinsic component of plasma membrane, receptor activity |
*Note: The best annotation strategy with p-value≤0.001 and no upper limit for term size; cluster IDs ordered by the enrichment score; cluster sizes are given in the brackets; the characteristic functions of CLEAN clusters that also appear significant in the corresponding funcExplorer clusters are highlighted in bold; TOP functions are from GO biological process (BP), cellular component (CC) and molecular function (MF)
Fig. 8Comparison of clusters from the Arabidopsis data. The seven clusters from funcExplorer F1 strategy analysis that match with the clusters reported in the previous study [42]. The funcExplorer cluster IDs with the equivalent cluster name from [42] (in the brackets) are shown on the left. The eigengene profiles and significant functions that describe the clusters are consistent with the previous analysis