Literature DB >> 35605200

Exploiting plant transcriptomic databases: Resources, tools, and approaches.

Peng Ken Lim¹, Xinghai Zheng¹, Jong Ching Goh¹, Marek Mutwil².

Abstract

There are now more than 300 000 RNA sequencing samples available, stemming from thousands of experiments capturing gene expression in organs, tissues, developmental stages, and experimental treatments for hundreds of plant species. The expression data have great value, as they can be re-analyzed by others to ask and answer questions that go beyond the aims of the study that generated the data. Because gene expression provides essential clues to where and when a gene is active, the data provide powerful tools for predicting gene function, and comparative analyses allow us to study plant evolution from a new perspective. This review describes how we can gain new knowledge from gene expression profiles, expression specificities, co-expression networks, differential gene expression, and experiment correlation. We also introduce and demonstrate databases that provide user-friendly access to these tools.

Entities: Chemical

Keywords: co-expression; comparative transcriptomics; databases; differential expression; gene expression; gene function

Mesh：

Year: 2022 PMID： 35605200 PMCID： PMC9284291 DOI： 10.1016/j.xplc.2022.100323

Source DB: PubMed Journal: Plant Commun ISSN： 2590-3462

Introduction

Plants are a vital crop source of food and feed and have tremendous ecological significance because of their monumental roles as primary producers in the terrestrial ecosystem (Friend, 2010). Furthermore, they are a valuable library of medicinal secondary metabolites that can be adapted for therapeutic use (Hussain et al., 2012; Rao and Ravishankar, 2002). Because plants are so central to our survival and well-being, we need a greater understanding of plant genes, which will allow us to improve desirable traits (Nowicka et al., 2018; Niazian, 2019). For example, the study of plant genes involved in secondary metabolism can elucidate biosynthetic pathways of medicinal compounds, which can be transferred to suitable expression hosts for large-scale production and modification (Paddon et al., 2013; Cravens et al., 2019; Sabzehzari et al., 2020). However, the characterization of plant gene function is a slow process, typically limited to a handful of model plants. This is mainly because of the high redundancy, complexity, and large size of plant genomes, lack of genetic transformation options, long generation time, and difficulty in cultivation (Kondou et al., 2010; Bolle et al., 2013; Xuan et al., 2021). In addition, many novel and unique genes discovered in non-model plants cannot be functionally annotated by sequence homology, as many plant genes are unique and do not show sequence similarity to characterized genes (Hamilton and Buell, 2012). Furthermore, functional annotation of plant genes with characterized homologs can still be difficult if they belong to large and functionally diverse gene families (He and Zhang, 2005). Consequently, many bioinformatic approaches have been developed that use genomic, transcriptomic, metabolomic, and phenotypic data to address the difficulties in gene function prediction (Radivojac et al., 2013; Rhee and Mutwil, 2014; Hansen et al., 2018). Most gene function prediction approaches are based on the principle of guilt by association, which states that genes with similar characteristics (e.g., sequence or expression) are likely to have the same function (Rhee and Mutwil, 2014). For example, two genes that show similar gene expression profiles across different organs and tissues are likely to be part of the same protein complex (Persson et al., 2005) or biosynthetic pathway (Delli-Ponti et al., 2021) or to be targeted to the same subcellular compartment (Ryngajllo et al., 2011). As publicly available gene expression data are growing at a near-exponential rate, many online databases with diverse functionalities and plant species have emerged as valuable contributions to the scientific community. Here, we provide a guide to the various concepts and databases that can be used for gene function prediction and as electronic evidence to support experimental results (Figure 1; Table 1). Because of the large number of online gene expression databases, we cover only those that offer unique approaches for studying gene function.

Figure 1

Data mining workflows using the different onboard functions/tools of plant transcriptomic databases.

Colored edges connect different aims and databases to one type of function/tool.

Table 1

Summary of the featured databases

Databases	Notable onboard functionalities/statistical methods
Gene expression profile analyses

ePlant by BAR	gene expression profiles as eFP heatmapseach eFP is generated from gene expression data from a few high-quality studiesscRNA-seq gene expression data for A. thaliana roots
Expression Atlas	gene expression profiles as heatmapsscRNA-seq gene expression data are available for A. thaliana, Oryza sativa, S. lycopersicum, and Zea mays
GENEVESTIGATOR	gene expression profiles in the form of boxplots, scatterplots, or heatmapsconditions in gene expression profiles are amalgamations of similar samples across multiple studies, with the exception of perturbation conditionsenables gene expression profiles to be generated from user-defined datasets
CoNekT-Plants	gene expression profiles as bar charts and heat mapsconditions in gene expression profiles are amalgamations of similar samples across multiple studiesgene expression profiles for most plants predominantly feature only anatomical conditionscross-species comparative gene expression profile analyses available between user-defined gene sets or within gene families

Co-expression analyses

Expression Angler by BAR	hosts thematic co-expression data from well-curated samples of relevant studiesuses PCC as a co-expression metric
ATTED-II	hosts comprehensive co-expression data amalgamated from both RNA-seq and microarray sampleshosts thematic co-expression datauses MR as a co-expression metricenables network visualization of gene co-expression neighborhoods
CoNekT-Plants	hosts comprehensive co-expression data from RNA-seq samplesuses HRR as a co-expression metricenables GO enrichment analyses on co-expression clustersenables comparative network analyses between species based on gene family informationenables network visualization of gene co-expression neighborhoods and clusters

DGE analyses

AtCAST	enables users to search for control/treatment DGE comparisons based on single- or multi-gene query
Expression Atlas	enables users to search for control/treatment DGE comparisons based on single- or multi-gene queryfeatures onboard functionalities to extract DEGs from DGE comparisons based on user-defined fold change and p valuecontrol and treatment samples in comparisons are predefined
GENEVESTIGATOR	enables users to search for control/treatment DGE comparisons based on single- or multi-gene queryfeatures onboard functionalities to extract DEGs from DGE comparisons based on user-defined fold change and p valuecontrol and treatment samples in comparisons are user defined

Gene expression-specificity/stability analyses

Rice Expression Database	enables housekeeping and specific genes to be identified based on the stability of expression across different conditionsstability is calculated using τ value
CoNekT-Plants	features tool to identify specific genes in tissue of choicespecificity of gene expression in tissue is calculated from SPM, Tau, and entropyenables cross- or within-species comparison of different tissue-specific gene sets based on gene family information
GENEVESTIGATOR	features tool to identify housekeeping genes based on the stability of expression across a set of user-defined conditionsstability is calculated based on expression variance score

Experiment correlation analyses

AtCAST	enables users to search for correlated experiments based on an experimental condition or user-uploaded gene expression datarelationships among experiments can be visualized as a network

Additional information

AtCASTSasaki et al., 2011	hosts species data for A. thaliana onlydatabase: http://atpbsmd.yokohama-cu.ac.jp/cgi/atcast/home.cgi
ATTED-IIObayashi et al., 2009	hosts species data for A. thaliana, Brassica rapa, G. max, M. truncatula, O. sativa, V. vinifera, S. lycopersicum, Z. mays, and Populus trichocarpadatabase: https://atted.jp/
CoNekT-PlantsProost and Mutwil, 2018	hosts species data for A. trichopoda, A. thaliana, Chlamydomonas reinhardtii, Cyanophora paradoxa, G. biloba, Marchantia polymorpha, O. sativa, Physcomitrella patens, Picea abies, Selaginella moellendorffii, V. vinifera, S. lycopersicum, and Z. maysdatabase: https://evorepro.sbs.ntu.edu.sg/
ePlant by BARWaese et al., 2017	hosts species data for A. thaliana, Cannabis sativa, Eucalyptus grandis, G. max, Helianthus annuus, Hordeum vulgare, P. trichocarpa, M. truncatula, O. sativa, S. lycopersicum, Saccharum officinarum, Sarracenia purpurea, Solanum tuberosum, Triticum aestivum, and Z. maysdatabase: http://bar.utoronto.ca/
Expression Angler by BARToufighi et al., 2005	hosts species data for A. thaliana and P. trichocarpadatabase: http://bar.utoronto.ca/
Expression AtlasKapushesky et al., 2010	hosts species data for A. thaliana, Brachypodium distachyon, G. max, H. vulgare, M. truncatula, O. sativa, Sorghum bicolor, S. lycopersicum, S. tuberosum, T. aestivum, V. vinifera, and Z. maysdatabase: https://www.ebi.ac.uk/gxa/home
GENEVESTIGATORHruz et al., 2008	hosts species data for A. thaliana, Brassica oleracea, G. max, H. vulgare, M. truncatula, Nicotiana tabacum, O. sativa, P. patens, S. bicolor, S. lycopersicum, T. aestivum, and Z. maysclient download: https://genevestigator.com/gv/start/start.jsp
Rice Expression DatabaseXia et al., 2017	hosts species data for O. sativa onlydatabase: http://expression.ic4r.org/

DGE, differential gene expression; eFP, electronic pictograph; PCC, Pearson’s correlation coefficient; MR, mutual rank; HRR, highest reciprocal rank; GO, Gene Ontology; DEGs, differentially expressed genes; SPM, specificity measure.

Data mining workflows using the different onboard functions/tools of plant transcriptomic databases. Colored edges connect different aims and databases to one type of function/tool. Summary of the featured databases DGE, differential gene expression; eFP, electronic pictograph; PCC, Pearson’s correlation coefficient; MR, mutual rank; HRR, highest reciprocal rank; GO, Gene Ontology; DEGs, differentially expressed genes; SPM, specificity measure.

Studying gene function with gene expression profiles

Because most genes are expected to be expressed only when needed, gene expression profiles are useful in predicting gene function. For example, a gene expressed specifically in pollen is likely to be important for pollen function (Bernal et al., 2008), whereas another gene strongly expressed during heat stress is likely to be necessary for survival during heat. Numerous studies and databases comprise customized expression compendia for answering different questions (Table 1). For example, GENEVESTIGATOR’s anatomy compendium contains 10 562 samples from 127 different anatomical parts of Arabidopsis thaliana (Hruz et al., 2008), showing the average expression values of each gene in diverse organs, tissues, and treatments (Table 1). The expression data in the ePlant database are also arranged into multiple compendia that capture various (a)biotic stresses. To demonstrate how gene expression profile analyses can reveal gene function, we will use Agamous (AG, AT4G18960), a floral homeotic gene that specifies floral meristem and carpel and stamen identity (Yanofsky et al., 1990). The ePlant database (https://bar.utoronto.ca/eplant/) (Waese et al., 2017) showcases expression patterns of a single gene as electronic fluorescent pictographs. Each gene expression profile generated is based on a high-quality transcriptome dataset that has been thoughtfully selected from one or two studies. The Developmental Map of AG shows strong expression in stamens and carpels at flowering stages 12 and 15 of A. thaliana, which is in line with the function of AG (Figure 2A). The advantage of this expression visualization is that the pictures provide an easy-to-understand anatomical context for the different samples. On the other hand, boxplots and bar charts are typically used to indicate the average/median expression of a single or a few genes over multiple samples. CoNekT-Plants (https://conekt.sbs.ntu.edu.sg/) (Proost and Mutwil, 2018) uses bar charts to indicate the mean expression, and the points above and below the bar indicate the maximum and minimum expression in a given organ, respectively (Figure 2B). The boxplots in GENEVESTIGATOR depict the mean expression, and whiskers indicate the distribution of the data with standard error (Figure 2C). Heatmaps are two-dimensional (or even three-dimensional; Fernandez-Pozo et al., 2020) grids suited to visualizing gene expression of multiple genes and conditions simultaneously. Typically, genes are arranged in rows, and organs, tissues, cell types, and treatments are shown in columns. The cells in the heatmap usually indicate the average expression of a gene in a given condition through the use of colors, numbers, or both. Although heatmaps cannot directly show the distribution of the data as bar charts and boxplots, they enable the plotting of dense, data-rich figures. For example, the AG expression profile in the GENEVESTIGATOR Condition Search in the Anatomy function uses a heatmap to show its expression profile in 127 different anatomical parts (Figure 2D). In addition, the tool shows the number of samples, and the users can also click on a specific organ, which will give more details on the experiments from which the sample data were obtained.

Figure 2

Methods to visualize gene expression profiles, demonstrated using AG (AT4G18960).

(A) ePlant: Plant Viewer eFP. Each organ is colored to indicate gene expression level; yellow and red colors indicate low and high transcripts per million (TPM)values, respectively.

(B) CoNekT-Plants. The samples and gene expression level are shown on the x and y axis, respectively. The bars and points indicate the mean and maximum/minimum expression.

(D) GENEVESTIGATOR: anatomy search tool. The different organs, tissues, and cell types have been arranged into logical hierarchies. The gene expression levels are indicated by a heatmap, where white and dark beige colors indicate low and high expression, respectively.

(E) Visualization of scRNA-seq data. Single-cell RNA-seq of A. thaliana cellulose synthase-like D3 (AT3G03050) visualized in ePlant. Each point depicts a single cell, and the low and high expression value of the gene are represented by yellow and red colors, respectively.

Methods to visualize gene expression profiles, demonstrated using AG (AT4G18960). (A) ePlant: Plant Viewer eFP. Each organ is colored to indicate gene expression level; yellow and red colors indicate low and high transcripts per million (TPM)values, respectively. (B) CoNekT-Plants. The samples and gene expression level are shown on the x and y axis, respectively. The bars and points indicate the mean and maximum/minimum expression. (C) GENEVESTIGATOR developmental atlas. Points indicate mean values, and whiskers indicate standard errors. (D) GENEVESTIGATOR: anatomy search tool. The different organs, tissues, and cell types have been arranged into logical hierarchies. The gene expression levels are indicated by a heatmap, where white and dark beige colors indicate low and high expression, respectively. (E) Visualization of scRNA-seq data. Single-cell RNA-seq of A. thaliana cellulose synthase-like D3 (AT3G03050) visualized in ePlant. Each point depicts a single cell, and the low and high expression value of the gene are represented by yellow and red colors, respectively.

Single-cell RNA-seq

Single-cell RNA sequencing (scRNA-seq) allows researchers to discover complex and novel cell populations and track the developmental pathways of distinct cell types (Hwang et al., 2018). Several databases containing scRNA-seq data exist, such as ePlant, Root Cell Atlas (http://wanglab.sippe.ac.cn/rootatlas; Zhang et al., 2019), The Plant scRNA-Seq Browser (https://www.zmbp-resources.uni-tuebingen.de/timmermans/plant-single-cell-browser; Ma et al., 2020), and the EBI Expression Atlas (https://www.ebi.ac.uk/gxa/sc/home) (Kapushesky et al., 2010). scRNA-seq datasets can be visualized as t-Distributed Stochastic Neighbor Embedding (tSNE) (Kobak and Berens, 2019) or Uniform Manifold Approximation and Projection (UMAP) (Becht et al., 2018) plots, which enable the identification of subpopulations of transcriptionally similar cells (Shulse et al., 2019). The subpopulations within the data can be identified by clustering (e.g., k value in the EBI Expression Atlas) or from other metadata (inferred cell type, genotype). To demonstrate how single-cell gene expression profiles can reveal gene function, we used CSLD3 (AT3G03050), a gene that functions to initiate root hair development (Hu et al., 2018), as a query in the Single Cell eFP tool of ePlant to generate a tSNE plot (Figure 2E). From the tSNE plot, the expression of CSLD3 is highest in root hair cells relative to other root cell types, which confirms its specialized function in root hair cells. Furthermore, within the different subpopulations of the root hair cells, CSLD3 has the highest expression in root hair cells at the early stage of differentiation, which is in line with the gene’s function in root hair initiation (Figure 2E) (Bernal et al., 2008, p. 2). This insight could not have been gained from typical expression analysis of roots and highlights the utility of scRNA-seq and the need for this type of data for other tissues, organs, and species.

Studying gene function and evolution with comparative expression profile analyses

Expression of multiple genes can be compared within and across species, typically as two-dimensional heatmaps in which rows and columns are configured to address a specific question. The comparative aspect of heatmaps can be used to identify functionally equivalent genes across species. For example, the CoNekT-Plants comparative heatmap viewer for the orthogroup containing AG presents genes from multiple species in rows and their expression values in different organs in columns (Figure 3A). As expected, AG shows predominantly flower-specific expression in A. thaliana, whereas STK, AGL1, and SHP2 show seed-specific expression, consistent with their function in seed development (Ehlers et al., 2016; Paolo et al., 2021). Ginkgo biloba contains one ortholog of AG (Gb_16301), which is also specifically expressed in the strobilus, a reproductive organ from the Flower category in CoNekT-Plants. Interestingly, Amborella trichopoda contains two orthologs of AG, of which only AMTR_s00021p00254030 shows flower-specific expression (Figure 3A). Thus, based on the comparative heatmap, AMTR_s00021p00254030, but not AMTR_s00071p00193200, is functionally equivalent to AG from A. thaliana.

Figure 3

Comparative gene expression analyses.

(A) Comparative expression heatmap of AG. Rows correspond to genes, and columns represent organs. Low and high expression values are represented by green and red cells, respectively. Black cells indicate missing data. Each row has been scaled to have one as the highest value.

(B) ePlant Navigator viewer. The cladogram captures phylogenetic relationships among genes, and the bars depict sequence and expression similarities of the genes to the query (AG).

(C) Phylogenetic tree of the AG orthogroup. The different species are represented by color-coded leaves (species without expression data are black). The expression levels in the organs are visualized as a heatmap, where low and high expressions are indicated by yellow and dark blue colors, respectively. The different clades and their expression are indicated by colored boxes. The species are indicated by gene names: Zm (Zea mays), LOC (Oryza sativa), Bradi (Brachypodium distachyon), Zosma (Zosteria marina), GS (Vitis vinifera), Sol (Solanum lycopersicum), AM (Amborella trichopoda), MA (Picea abies), and Gb (Ginkgo biloba).

Comparative gene expression analyses. (A) Comparative expression heatmap of AG. Rows correspond to genes, and columns represent organs. Low and high expression values are represented by green and red cells, respectively. Black cells indicate missing data. Each row has been scaled to have one as the highest value. (B) ePlant Navigator viewer. The cladogram captures phylogenetic relationships among genes, and the bars depict sequence and expression similarities of the genes to the query (AG). (C) Phylogenetic tree of the AG orthogroup. The different species are represented by color-coded leaves (species without expression data are black). The expression levels in the organs are visualized as a heatmap, where low and high expressions are indicated by yellow and dark blue colors, respectively. The different clades and their expression are indicated by colored boxes. The species are indicated by gene names: Zm (Zea mays), LOC (Oryza sativa), Bradi (Brachypodium distachyon), Zosma (Zosteria marina), GS (Vitis vinifera), Sol (Solanum lycopersicum), AM (Amborella trichopoda), MA (Picea abies), and Gb (Ginkgo biloba). Gene expression information can be combined with phylogenetic trees to better understand gene function and evolution. Navigator Viewer in ePlant shows the phylogenetic tree of the orthogroup of the query gene, as well as sequence similarity percentages and gene expression similarity scores of the orthogroup genes to the query (Figure 3B). In this example, of the two Medicago truncatula genes, MEDTR8G087860 (sequence similarity 67; expression similarity 0.82) is more likely to be functionally equivalent to A. thaliana AG than MEDTR2G017865 (sequence similarity 67; expression similarity 0.66), because the latter has a lower expression similarity score. CoNekT-Plants combines phylogenetic trees with expression heatmaps to place gene expression in the context of speciation and duplication events in an orthogroup (Figure 3C). The tree reveals a monocot and eudicot flower/female/seeds clade and another monocot and eudicot female/seeds clade, suggesting a duplication and subspecialization of the AG orthogroup in the ancestor of eudicots and monocots. Interestingly, there is only one clade for A. trichopoda, which contains two genes that each have expression profiles reminiscent of the clades found in monocots and eudicots: flower/female and female. However, it is unclear whether the two A. trichopoda genes are expressed in seeds owing to missing seed data (Figure 3C). The gymnosperm G. biloba gene shows expression in strobili (labeled as flowers in CoNekT-Plants) and female organs, but not in seeds. This indicates that the seed-specific expression of the AG orthogroup evolved in the ancestor of flowering plants, perhaps even in the ancestor of Amborellales. At the same time, the ancestral gene has probably been duplicated and sub-functionalized to perform diverse functions in flower and seed development.

Predicting gene function with co-expression analysis

Co-expression analysis identifies groups of genes with similar expression profiles. Co-expression is based upon the principle of guilt by association, which states that genes involved in similar biological processes should have similar expression profiles across different organs, tissues, cells, and (a)biotic and genetic perturbations (Oliver, 2000). As such, co-expression analyses have been used to great effect to (1) further the understanding of known biochemical pathways (Lau and Sattely, 2015; Caputi et al., 2018), (2) elucidate the role of genes in biological processes (Brown et al., 2005), and (3) predict the functions of unknown genes (Gao et al., 2018; He et al., 2020). Valuable insights into gene function can already be uncovered by identifying genes that are co-expressed with a particular gene of interest (GOI). Almost all databases featuring co-expression data allow users to retrieve co-expressed genes via a single-gene query. In ATTED-II (Obayashi et al., 2009) and CoNekT-Plants, one can access co-expression information on GOIs via gene pages, which contain information dedicated to particular genes. In ATTED-II, webpages for co-expression gene lists (https://atted.jp/gene_coexpression/?gene_id=827631&sp=ath; Supplemental Figure 1A) can be retrieved by clicking the "Coexpressed gene list for " link on the gene page. Users have control over visualization and functional control on this page, enabling them to adjust the amount of descriptive information (e.g., full gene names, Kyoto Encyclopedia of Genes and Genomes [KEGG] annotations) and to select options for ranking co-expressed genes based on different transcriptome datasets. The top 10 genes for AG include SEP1/2/3 and AP3, which are also involved in flower development (Hugouvieux et al., 2018; Jetha et al., 2014; Parenicová et al., 2003). By contrast, CoNekT-Plants allows users to download the co-expression gene list as a text file for offline viewing by clicking a button in the Coexpression Networks section (Supplemental Figure 1B, red circle) on the gene page. Expression Angler from BAR (Toufighi et al., 2005) offers a choice of multiple gene expression datasets and an interactive interface for adjusting the r-value threshold (http://bar.utoronto.ca/ExpressionAngler/?agi_id=AT4G18960&match_count=25&active_view=Developmental%20Map). In addition to the typical list of genes, the tool also displays an expression profile heatmap, enabling users to further evaluate the expression similarity (Supplemental Figure 1C).

Different strategies for generating co-expression data

The databases use different strategies for generating co-expression data. These differences include the selection of the transcriptome dataset, gene expression quantification and normalization, batch correction, sample balancing, and statistical metrics used to measure co-expression, which further add to the degrees of freedom in which co-expression analyses can be implemented. Currently, there is a lack of consensus on the best strategy, as different approaches are suited to answering different biological questions (reviewed in Rao and Dixon, 2019; Serin et al., 2016; Usadel et al., 2009), and this translates into poor agreement among the databases. Expression Angler uses Pearson’s correlation coefficient (PCC) to measure the linear correlation between gene pairs across samples. However, PCC can be sensitive to outliers, and false correlations between otherwise uncorrelated genes can be inferred if they have a high expression value in a single sample (Rao and Dixon, 2019; Usadel et al., 2009). To mitigate sensitivity to outliers (Obayashi and Kinoshita, 2009), ATTED-II and CoNekT-Plants rank genes based on PCC scores to generate directional rankings between gene pairs. Because directional rankings within a gene pair are likely to be asymmetrical (e.g., gene B ranks highest within gene A’s list, whereas gene A ranks second within gene B’s list), ATTED-II takes the geometric mean of the two to produce a mutual rank (MR) score. CoNekT-Plants uses the highest reciprocal rank (HRR), which retains the higher of the two ranks within each gene pair to measure co-expression (Mutwil et al., 2010). CoNekT-Plants uses a condition-independent approach in which the transcriptome dataset consists of all available RNA sequencing (RNA-seq) samples that passed quality control. By contrast, Expression Angler utilizes a condition-dependent approach and hosts co-expression data built from different thematic transcriptome datasets. For example, the Developmental Map dataset consists of carefully selected microarray experiments from A. thaliana samples at different stages of development (Nakabayashi et al., 2005; Schmid et al., 2005), and the Abiotic Stress dataset contains A. thaliana data capturing different abiotic stresses (Kilian et al., 2007). ATTED-II offers the best of both worlds, as it caters to both approaches; it aggregates both RNA-seq and microarray samples to generate a Universal co-expression dataset, and it also features co-expression data from thematic condition-dependent datasets (e.g., Tissue dataset, Hormones dataset). These methodological and statistical differences may produce different results when the same gene is queried in the three databases. As an example of this discrepancy, the gene lists of the top 50 genes co-expressed with AG according to the three databases (CoNekT-Plants-Plants, ATTED-II; Universal dataset, Expression Angler; Developmental Map dataset) share only two common genes (data not shown). These two genes are SEP1 (AT5G15800) and SEP2 (AT3G02310). This relatively poor agreement between co-expression networks has been observed before (Jaccard index of ∼0.1 between different versions of A. thaliana networks; Obayashi et al., 2018). Therefore, although co-expression clearly works, it is likely that different databases will return different (to a degree) co-expression lists and networks. Consequently, we advise comparing results from multiple databases.

Visualization and analyses using co-expression networks

A co-expression network typically features genes as nodes and co-expression relationships between genes as edges between nodes (Usadel et al., 2009). The topology of the co-expression network enables powerful analyses based on network theory to extract additional insights. For example, genes with more connections (higher degree) tend to be more essential and produce more severe phenotypes when knocked out (Mutwil et al., 2010). Betweenness centrality (the degree to which nodes are shortest-path connectors) has also been used to identify hub genes that are highly connected in the co-expression network and functionally important to biological processes (van Dam et al., 2018). Owing to the hierarchical organization of gene-to-gene relationships inherent in complex biological systems (e.g., regulator-regulatee, upstream-downstream effector relationships), co-expression networks have a heterogeneous topology, with regions of highly interconnected genes, called clusters or modules (Usadel et al., 2009; Ruprecht et al., 2016). These clusters have been shown to represent genes in important biological processes (van Dam et al., 2018). The visualization of co-expression networks offers a more realistic and informative representation of co-expression relationships than gene lists, as networks display all pairwise relationships (Jupiter and VanBuren, 2008). Visualization of a whole co-expression network is often impractical, as the networks often contain tens of thousands of nodes and several orders of magnitude more edges. ATTED-II and CoNekT-Plants allow visualizations of co-expression neighborhoods (genes closely connected to GOIs) and also map more functional data onto the networks by modifying the styles of nodes and edges. CoNekT-Plants enables users to view neighborhoods via the Coexpression Networks' section of gene pages, which shows co-expression relationships among all genes co-expressed with AG (Figure 4A). In CoNekT-Plants, node color and shape depict orthogroups, and edges represent co-expression relationships (HRR cutoff ≤100). These networks are dynamic and can be adjusted to highlight important parts of the network; the colors and shapes of nodes can be changed to represent orthogroups or Pfam domains, and edge colors can be toggled to indicate HRR scores. The neighborhood networks in ATTED-II are built from multiple neighborhood levels of AG and indicate protein-protein interactions with red dotted lines (Figure 4B).

Figure 4

Different visualizations of co-expression networks.

(A) CoNekT-Plants: co-expression cluster 69 that contains AG and 74 other genes and is enriched in genes with GO:0080086 (stamen filament development), GO:0048443 (stamen development), GO:0048441 (petal development), and GO:0009733 (response to auxin). For brevity, only genes involved in flower development (yellow rectangle) and response to auxin (blue squares) are displayed. Genes from the same gene family are labeled with the same node shape and color.

(B) ATTED-II: co-expression network from the local co-expression neighborhood of AG. A thicker edge indicates a stronger correlation between genes. Red dotted lines indicate protein-protein interactions.

(C) CoNekT-Plants: cross-species comparison between cluster 69 in A. thaliana (green rectangle) and cluster 51 in V. vinifera (red rectangle). Blue dotted edges connect genes from the same gene family. Genes from the same gene family are labeled with the same node shape and color.

Different visualizations of co-expression networks. (A) CoNekT-Plants: co-expression cluster 69 that contains AG and 74 other genes and is enriched in genes with GO:0080086 (stamen filament development), GO:0048443 (stamen development), GO:0048441 (petal development), and GO:0009733 (response to auxin). For brevity, only genes involved in flower development (yellow rectangle) and response to auxin (blue squares) are displayed. Genes from the same gene family are labeled with the same node shape and color. (B) ATTED-II: co-expression network from the local co-expression neighborhood of AG. A thicker edge indicates a stronger correlation between genes. Red dotted lines indicate protein-protein interactions. (C) CoNekT-Plants: cross-species comparison between cluster 69 in A. thaliana (green rectangle) and cluster 51 in V. vinifera (red rectangle). Blue dotted edges connect genes from the same gene family. Genes from the same gene family are labeled with the same node shape and color.

Identification of biological pathways with co-expression networks

Co-expression networks can be used to reveal known and novel pathways by identifying groups of highly connected genes (clusters/modules). For example, CoNekT-Plants partitions co-expression networks into co-expression modules with the heuristic cluster chiseling algorithm (HCCA) (Mutwil et al., 2010). The functions of these modules can be predicted by over-representation (enrichment) analyses of biological functions captured by, e.g., Gene Ontology (GO; Ashburner et al., 2000) or MapMan terms (Thimm et al., 2004). For example, AG belongs to cluster 69 of A. thaliana (https://conekt.sbs.ntu.edu.sg/cluster/view/453; links to cluster pages can be accessed via gene pages), which is enriched for genes involved in flower developmental processes (e.g., stamen development, petal development) and auxin response (Figure 4A). This agrees with the known function of AG as a transcription factor that specifies floral meristem, carpel, and stamen identity and highlights the link between auxin homeostasis and flower development (Yamaguchi et al., 2017). The identity of other co-expression clusters involved in stamen development across multiple species can be revealed by, e.g., clicking on the "GO:0048443 (Stamen development)" link on the cluster page or using the "Find enriched clusters" tool (https://conekt.sbs.ntu.edu.sg/search/enriched/clusters), with which one can find co-expression clusters involved in any process captured by GO. Thus, co-expression offers a powerful hypothesis-generating tool for predicting gene function and identifying new genes involved in the biological process of interest.

Comparative co-expression network analyses

Similarly, to conserve gene expression profiles across multiple species (Figure 3), regions of co-expression networks can be conserved even across vast evolutionary distances (Mutwil et al., 2011; Ferrari and Mutwil, 2019). Comparative co-expression network analyses enable the transfer of functional knowledge from model to non-model species (Movahedi et al., 2012; Sibout et al., 2017) and the study of biological pathway evolution (Ruprecht et al., 2017; Ferrari et al., 2020). Furthermore, because conserved co-expression relationships tend to be functionally significant (Hansen et al., 2014), a comparative analysis can remove potentially irrelevant co-expression relationships. Using AG, we demonstrate how functional information can be transferred between species. AG is a MADS domain transcription factor, and CoNekT-Plants identified 413 other genes in the same gene family (orthogroup) across 13 different Viridiplantae (https://conekt.sbs.ntu.edu.sg/family/view/12). CoNekT-Plants uses orthogroups to score similarities between clusters, as similar clusters are expected to contain a high number of identical orthogroups. The "Expression context conservation (ECC)" page for AG (https://conekt.sbs.ntu.edu.sg/sequence/view/46111) lists all genes in the AG orthogroup with similar co-expression neighborhoods https://conekt.sbs.ntu.edu.sg/ecc/graph/46111/2/1). This list can be sorted by the ECC value (Jaccard index orthogroup similarity of cluster pairs) to suggest functionally equivalent genes across species or duplicated gene modules within species (Delli-Ponti et al., 2021; Ruprecht et al., 2016). In the cluster page for cluster 69 of A. thaliana (https://conekt.sbs.ntu.edu.sg/cluster/view/453), the Similar Clusters section allows one to visually compare clusters in a network view. A comparison between cluster 69 from A. thaliana and cluster 51 from Vitis vinifera with a Jaccard index of 0.038 (https://conekt.sbs.ntu.edu.sg/graph_comparison/cluster/453/6929/2) identified genes encoding transcription factors involved in floral development (Cheng et al., 2009; Gross et al., 2018), such as MYB21 (AT3G27810), MYB24 (AT5G40350), MYB57 (AT3G01530), and CRC (AT1G69180) (Figure 4C). Because the grape cluster contains a similar set of genes to the A. thaliana cluster, it is likely that the genes in the grape cluster are also involved in flower development.

Predicting gene function with differential gene expression analysis

A classical treatment/control experiment can reveal hundreds of differentially expressed genes (DEGs) and can identify genes important for the plant response to the treatment. For example, genes that are upregulated during cold treatment may be necessary for surviving freezing temperatures (To et al., 2011, p. 6). Thus, identifying these DEGs can rapidly dissect the mechanism of freezing tolerance. Expression Atlas (Supplemental Figure 2A) can identify experimental conditions in which a GOI is significantly (indicated by p value) up- or downregulated (indicated by log-fold change). AG is significantly upregulated in the comparison of the clf28 mutant versus wild type, consistent with the fact that CLF28 is necessary for AG repression (Krizek et al., 2006). Users can also click on clf28 mutant versus wild type in Shoot to view the experimental information and identify all DEGs specific to this comparison. Similarly, AtCAST (Sasaki et al., 2011) revealed that AG is downregulated (negative signal ratio) in its mutant background (ag-12) and in the lfy-12 mutant, which supports the activation of AG by LFY (Supplemental Figure 2B) (Busch et al., 1999). In addition, Expression Atlas allows the users to specify a set of genes (e.g., members of a biological pathway), which are then matched to the precalculated DEGs from all experiments in the database (Supplemental Figure 2C). Thus, users can rapidly identify experimental conditions that cause the largest expression changes in their genes of interest. The database ranks the matches by the observed/expected ratio, i.e., the ratio between the observed intersection of the user gene set with the differentially expressed genes (DEGs) in an experiment and the intersection expected by chance. Although the above tools use predefined control/treatment experimental groups, GENEVESTIGATOR allows users to define the control-group and treatment-group experiments and reveals the corresponding upregulated and downregulated gene sets (Supplemental Figure 2D).

Identifying housekeeping reference genes

Genes expressed at relatively constant levels are often involved in maintaining basal cellular functions that are essential for the existence of a cell (Gutierrez et al., 2008; Lin et al., 2014). These genes, termed reference genes or housekeeping genes, are often used as normalization controls in transcript quantification assays such as qRT-PCR (e.g., GAPDH or ACTB). However, an over-reliance on an all-purpose group of commonly used housekeeping genes for normalization may not be ideal because their expression, although stable across some conditions, is not universally stable across all conditions (Kozera and Rapacz, 2013; Joseph et al., 2018). Thus, it is vital to identify housekeeping genes that are specific to the experimental context to address this issue. To that end, the RefGenes tool of GENEVESTIGATOR enables the identification of context-specific housekeeping genes with the highest expression stability across a user-defined set of conditions (Supplemental Figure 3A). Users can also filter candidate reference genes based on their expression level, which is essential when choosing reference genes for qRT-PCR (Kozera and Rapacz, 2013). Similarly, Rice Expression Database (Xia et al., 2017) also allows users to extract housekeeping genes using τ values which range from 0 to 1 (Supplemental Figure 3B).

Predicting gene function by specificity analyses

On the opposite end of the spectrum from housekeeping genes, specific (selective) genes are expressed exclusively in a particular organ, tissue, developmental stage, or treatment. Because organ-specific genes are likely to be important for the organ’s function (e.g., flower-specific AG is important for flower development; Figure 2) (Julca et al., 2021), extracting and identifying specifically expressed genes is a powerful method for predicting gene function. CoNekT-Plants uses the specificity measure (SPM; ranges between zero and one, where one indicates that the gene is exclusively expressed in the tissue), Tau (high values indicate that a profile is specified in a tissue), and entropy (indicates how much a profile fluctuates across all tissues, and genes with very specific or very stable expression have low entropy) to identify specific genes. Users can query a certain condition or tissue type in the Specific Profiles tool, which will provide a gene list with information about the SPM score, entropy score, and Tau score of specific genes. Users can also compare two lists of specifically expressed genes within or across species in the Compare Specificities tool (https://conekt.sbs.ntu.edu.sg/specificity_comparison/). For example, by selecting Arabidopsis thaliana/Tissue Specificity/Flower/SPM 0.85 and Solanum lycopersicum/Tissue Specificity/Flower/SPM 0.85 and clicking Compare Specificity, one obtains 107 orthogroups specifically expressed in the flowers of the two species (Supplemental Figure 4A). The list of the orthogroups contains A. thaliana AG, its four orthologs from tomato, and, e.g., MYB57, which is essential for stamen development (Cheng et al., 2009, p. 21). Similar to CoNekT-Plants, GENEVESTIGATOR contains a Gene Search tool that can help to identify specific genes. This tool offers a greater degree of freedom in interrogating the database for specific genes, as users can define a set of experiments as background and choose a subset of experiments in which to identify specifically expressed genes (Supplemental Figure 4B).

Experiment correlation tools

When the gene expression profiles of two independent experiments are highly correlated, we can infer that the treatments, genotypes, organs, or other experimental variables are related. Thus, the genome-wide comparison of gene expression patterns of experiments can uncover novel biological relationships. AtCAST (Sasaki et al., 2011) collects publicly available A. thaliana gene expression data and provides a list of experiments that are positively and negatively correlated to the query experiment. To mitigate the noise between gene expression data from different studies, the tool only compares the expression profiles of genes with significant expression changes (termed "module" according to the authors). Here, an RNA-seq experiment on the AG mutant (ag-12) is strongly correlated with other samples from LFY, UFO, and AP3 mutants (Supplemental Figure 5A), consistent with the shared functions of these genes in flower development (Ng and Yanofsky, 2001). Users can also upload their RNA-seq or microarray data to screen the compendium for similar experiments. The system also provides a graph-based network to visualize the relationship of experiments in the table (Supplemental Figure 5B). Furthermore, users can click the More Info button behind each experiment in the table to view its module genes (Supplemental Figure 5C).

Concluding remarks

In order to uncover the functions of plant genes, a multitude of analytical methods have been developed to interrogate the rapidly growing publicly available gene expression data. To make the outcome of these methods readily available to non-bioinformaticians, the bioinformatics community has produced various plant gene expression databases. In this review, we have discussed the principles behind the different gene expression analyses and demonstrated the databases that offer these analyses. Thus, the ubiquity and usefulness of gene expression data and the plethora of online resources provide powerful gene function prediction tools that plant biologists should incorporate into their research.

Funding

This review was supported by a tier-two grant (MOE2018-T2-2-053) and a Singapore Food Agency grant (SFS_RND_SUFP_001_05).

81 in total

1. Rapid subfunctionalization accompanied by prolonged and substantial neofunctionalization in duplicate gene evolution.

Authors: Xionglei He; Jianzhi Zhang
Journal: Genetics Date: 2005-01-16 Impact factor: 4.562

2. The protein encoded by the Arabidopsis homeotic gene agamous resembles transcription factors.

Authors: M F Yanofsky; H Ma; J L Bowman; G N Drews; K A Feldmann; E M Meyerowitz
Journal: Nature Date: 1990-07-05 Impact factor: 49.962

Review 3. Co-expression networks for plant biology: why and how.

Authors: Xiaolan Rao; Richard A Dixon
Journal: Acta Biochim Biophys Sin (Shanghai) Date: 2019-09-06 Impact factor: 3.848

4. ePlant: Visualizing and Exploring Multiple Levels of Data for Hypothesis Generation in Plant Biology.

Authors: Jamie Waese; Jim Fan; Asher Pasha; Hans Yu; Geoffrey Fucile; Ruian Shi; Matthew Cumming; Lawrence A Kelley; Michael J Sternberg; Vivek Krishnakumar; Erik Ferlanti; Jason Miller; Chris Town; Wolfgang Stuerzlinger; Nicholas J Provart
Journal: Plant Cell Date: 2017-08-14 Impact factor: 11.277

Review 5. Beyond Genomics: Studying Evolution with Gene Coexpression Networks.

Authors: Colin Ruprecht; Neha Vaid; Sebastian Proost; Staffan Persson; Marek Mutwil
Journal: Trends Plant Sci Date: 2017-01-23 Impact factor: 18.313

6. AtCAST, a tool for exploring gene expression similarities among DNA microarray experiments using networks.

Authors: Eriko Sasaki; Chitose Takahashi; Tadao Asami; Yukihisa Shimada
Journal: Plant Cell Physiol Date: 2010-11-26 Impact factor: 4.927

Review 7. Application of genetics and biotechnology for improving medicinal plants.

Authors: Mohsen Niazian
Journal: Planta Date: 2019-02-04 Impact factor: 4.116

8. CRABS CLAW Acts as a Bifunctional Transcription Factor in Flower Development.

Authors: Thomas Gross; Suvi Broholm; Annette Becker
Journal: Front Plant Sci Date: 2018-06-20 Impact factor: 5.753

9. ATTED-II in 2018: A Plant Coexpression Database Based on Investigation of the Statistical Property of the Mutual Rank Index.

Authors: Takeshi Obayashi; Yuichi Aoki; Shu Tadaka; Yuki Kagaya; Kengo Kinoshita
Journal: Plant Cell Physiol Date: 2018-01-01 Impact factor: 4.927

10. The art of using t-SNE for single-cell transcriptomics.

Authors: Dmitry Kobak; Philipp Berens
Journal: Nat Commun Date: 2019-11-28 Impact factor: 14.919

2 in total

Review 1. Gene Co-Expression Network Tools and Databases for Crop Improvement.

Authors: Rabiatul-Adawiah Zainal-Abidin; Sarahani Harun; Vinothienii Vengatharajuloo; Amin-Asyraf Tamizi; Nurul Hidayah Samsulrizal
Journal: Plants (Basel) Date: 2022-06-21

Review 2. Multiomics Molecular Research into the Recalcitrant and Orphan Quercus ilex Tree Species: Why, What for, and How.

Authors: Ana María Maldonado-Alconada; María Ángeles Castillejo; María-Dolores Rey; Mónica Labella-Ortega; Marta Tienda-Parrilla; Tamara Hernández-Lao; Irene Honrubia-Gómez; Javier Ramírez-García; Víctor M Guerrero-Sanchez; Cristina López-Hidalgo; Luis Valledor; Rafael M Navarro-Cerrillo; Jesús V Jorrin-Novo
Journal: Int J Mol Sci Date: 2022-09-01 Impact factor: 6.208

2 in total