| Literature DB >> 28077403 |
Sipko van Dam1, Urmo Võsa1, Adriaan van der Graaf1, Lude Franke1, João Pedro de Magalhães2.
Abstract
Gene co-expression networks can be used to associate genes of unknown function with biological processes, to prioritize candidate disease genes or to discern transcriptional regulatory programmes. With recent advances in transcriptomics and next-generation sequencing, co-expression networks constructed from RNA sequencing data also enable the inference of functions and disease associations for non-coding genes and splice variants. Although gene co-expression networks typically do not provide information about causality, emerging methods for differential co-expression analysis are enabling the identification of regulatory genes underlying various phenotypes. Here, we introduce and guide researchers through a (differential) co-expression analysis. We provide an overview of methods and tools used to create and analyse co-expression networks constructed from gene expression data, and we explain how these can be used to identify genes with a regulatory role in disease. Furthermore, we discuss the integration of other data types with co-expression networks and offer future perspectives of co-expression analysis.Entities:
Mesh:
Year: 2018 PMID: 28077403 PMCID: PMC6054162 DOI: 10.1093/bib/bbw139
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1Example of a co-expression network analysis. First, pairwise correlation is determined for each possible gene pair in the expression data. These pairwise correlations can then be represented as a network. Modules within these networks are defined using clustering analysis. The network and modules can be interrogated to identify regulators, functional enrichment and hub genes. Differential co-expression analysis can be used to identify modules that behave differently under different conditions. Potential disease genes can be identified using a guilt-by-association (GBA) approach that highlights genes that are co-expressed with multiple disease genes.
Figure 2Hypothetical network explaining inter- and intra-modular hubs and network centrality. The inter-modular hub has a high network centrality, as it is required for the largest number of shortest paths between all possible node pairs. The red line indicates an example of a shortest path through the network between a pair of nodes. Intra-modular hubs (marked with orange) are central to individual modules and usually have high biological relevance.
Methods and tools for RNA-seq-based co-expression network analysis
| Tool/method | Description, strengths (+) and limitations (−) |
|---|---|
| Quality control | |
| FastQC [ | • A tool that uses .fastq, .bam or .sam files to identify and highlight potential issues in the data, such as low base quality scores, low sequence quality and GC content biases. |
| + Can be used either with or without user interface. | |
| − Uses only the first 200 000 sequences in the file. | |
| RSeQC [ | + A tool with a wider range of quality control measures than FastQC. |
| + Can also be used on mapped data to obtain information on metrics such as the prevalence of splicing events. | |
| QoRTs [ | + This is a similar tool to RSeQC but incorporates more quality control metrics. |
| Read Mappers | |
| Bowtie/Tophat/Tophat2 [ | • The first widely used mapping tool. |
| + Detects splice variants. | |
| − Currently much slower than most other mappers and requires a relatively large amount of memory. | |
| STAR [ | • A widely used tool to align reads to a genome. |
| + Maps ∼50 times faster than Tophat and Tophat2. | |
| + Commonly used tool to detect novel splice variants. | |
| − Uses a large amount of memory (>20 GB for mapping to the human genome). | |
| HISAT [ | • A widely used tool to align reads to a genome at a faster rate than STAR with comparable accuracy. |
| + HISAT2 is expected to be the core of the next version of Tophat (Tophat3). | |
| + Detects novel splice variants. | |
| + The newer HISAT2 version aligns to genotype variants, likely achieving higher accuracy. | |
| + Uses less memory than STAR (<8 GB for mapping to the human genome using default settings). | |
| BWA [ | • A commonly used aligner for species in which splicing does not occur. |
| − Does not detect splice variants. | |
| Kallisto [ | • A tool that uses a pseudoalignment strategy to assign expression values to transcripts/genes to achieve optimal speed. |
| • Comparable accuracy to other tools using real alignment strategies. | |
| • Reports reads/expression per gene instead of read alignment coordinates (which are commonly used to acquire the expression per gene). | |
| + Uses little memory and can be run on a regular desktop computer. | |
| − Does not identify novel splice variants | |
| Salmon [ | • Another pseudoalignment tool. Performance comparable with Kallisto. |
| • Reports reads/expression per gene instead of read alignment coordinates (which are commonly used to acquire the expression per gene). | |
| − Does not identify novel splice variants. | |
| Read counting tools | |
| HTseq [ | • A tool that assigns expression values to genes based on reads that have been aligned with, e.g. STAR or HISAT. |
| + Well documented and supported. | |
| FeatureCounts [ | + A tool that is similar to HTseq but much faster. Results are slightly different owing to slightly different expression assignment strategies. |
| SpliceNet [ | • A tool that divides the reads mapping to an exon shared with two isoforms proportionally to the total expression of each of the two whole isoforms. |
| + Estimates expression more accurately when multiple genes/transcripts partly share the same genome regions. | |
| Normalization | |
| FPKM/RPKM [ | • Widely used normalization methods that correct for the total number of reads in a sample while accounting for gene length. |
| − TMM has been suggested as a better alternative [ | |
| TPM [ | • A method similar to FPKM, but normalizes the total expression to 1 million, i.e. the summed expression of TPM-normalized samples is always 1 million. |
| TMM [ | • Similar to FPKM/RPKM but puts expression measures on a common scale across different samples. |
| RAIDA [ | • A method that uses ratios between counts of genes in each sample for normalizations. |
| + Avoids problems caused by differential transcript abundance between samples (resulting from differential expression of highly abundant gene transcripts). | |
| DEseq2 [ | • A normalization method that adjusts the expression values of each gene in a sample by a set factor. This factor is determined by taking the median gene expression in a sample after dividing the expression of each gene by the geometric mean of the given gene across all samples. This differs from the normalization implemented in the DEseq2 differential expression analysis. |
| • Implemented into the DEseq2 R package. | |
| Correction for batch effects | |
| Limma-removeBatchEffect [ | • A method which uses linear models to correct for batch effects. |
| Svaseq [ | • This method estimates biases based on genes that have no phenotypic expression effects, which are then used for correction of the data. |
| • Specifically designed for RNA-seq data. | |
| Combat [ | • A method that is robust to outliers and also effective at batch effect correction in small sample sizes (<25). |
| Co-expression module detection | |
| WGCNA [ | • A tool that constructs a co-expression network using Pearson correlation (default) or a custom distance measure. |
| • Uses hierarchical clustering and has various ‘tree cutting’ options to identify modules. | |
| + Most widely used tool, well supported and documented. | |
| DiffCoEx [ | • A method that uses a similar approach to WGCNA to identify and group differentially co-expressed genes instead of identifying co-expressed modules. |
| • Identifies modules of genes that have the same different partners between different samples. | |
| DICER [ | • A method that identifies modules that correlate differently between sample groups, e.g. modules that form one large interconnected module in one group compared with several smaller modules in another group. |
| CoXpress [ | • A tool that identifies co-expression modules in each sample group and tests whether the genes within these modules are also co-expressed in other groups. |
| DINGO [ | • DINGO is a more recent tool that groups genes based on how differently they behave in a particular subset of samples (representing e.g. a particular condition) from the baseline co-expression determined from all samples |
| GSCNA [ | • A tool that tests whether a predefined defined gene set is differentially expressed between two sample groups. |
| GSVD [ | • A method that identifies ‘genelets’, which can be interpreted as modules representing partial co-expression signals from multiple genes. These signals are then compared between two groups to identify genelets unique to samples and genelets that are shared between the two groups. |
| HO-GSVD [ | • A tool similar to GSVD, but that can be used across multiple sample groups rather than only two. |
| Biclustering [ | • A group of methods that identify modules that are unique to a subpopulation of samples without the need for prior grouping of samples. |
| Functional enrichment | |
| DAVID [ | • A widely used tool with an online web interface. Users supply a list of genes and select the annotation categories from various sources to identify enrichment. |
| PANTHER [ | • A tool that uses a comprehensive protein library combined with human curated pathways and evolutionary ontology. |
| • If a gene is not in the library, it is classified based on its protein sequence conservation and by finding a related gene. | |
| g:Profiler [ | • A tool that performs enrichment analyses for gene ontologies, KEGG pathways, protein–protein interactions, TF and miRNA binding sites. |
| + Also available as an R package. | |
| ClusterProfiler [ | • An R package for overrepresentation and gene set enrichment analyses for several curated gene sets. |
| + Allows users to compare the results of analyses performed on several gene sets. | |
| Enrichr [ | • An intuitive web tool for performing gene overrepresentation analyses using a comprehensive set of functional annotations. |
| ToppGene [ | • An intuitive tool that determines enrichment of different categories such as GO terms, chromosomal locations and disease associations. |
| + Also has other functions, such as candidate gene prioritization, based on network structures. | |
| Regulatory network inference | |
| ARACNE [ | • A tool that removes indirect connections between genes (i.e. partners of a gene that have a stronger correlation with each other than with the gene itself), leaving only those connections that are expected to be regulatory. |
| + Creates directional networks. | |
| Genie3 [ | • A tool that incorporates TF information to construct a regulatory network by determining the TF expression pattern that best explains the expression of each of their target genes. |
| + Creates directional networks. | |
| − Requires TF information. | |
| CoRegNet [ | • A tool that identifies co-operative regulators of genes from different data types. |
| cMonkey [ | • Calculates joint bicluster membership probability from different data types by identifying groups of genes that group together in multiple data types. |
| Visualization | |
| Cystoscape [ | • A widely used tool for the visualization of networks. |
| + Has many plug-ins available for specific analyses. | |
| BioLayout [ | • Similar to Cytoscape but less widely used. |
| + Can load and visualize much larger networks than Cytoscape. | |
| Co-expression databasesa | |
| COXPRESdb [ | • A web resource incorporating 12 co-expression networks for different species created from ∼157 000 microarrays and 10 000 RNA-seq samples. Has a focus on protein-coding RNAs. |
| GeneFriends [ | • Human and mouse gene and transcript co-expression networks. |
| • Networks constructed from ∼4000 RNA-seq samples each. | |
| + Includes a number of non-coding RNAs (∼10 000 for mouse and ∼25 000 for human). | |
| GeneMANIA [ | • Also includes physical and genetic interaction, co-localization, pathway and shared protein domain information data sets. |
| + Networks for nine species. | |
| GENEVESTIGATOR [ | • A database constructed using ∼145 000 samples. |
| + Curated database. | |
| + Networks for 18 species. | |
| + Multiple data types. | |
| GIANT [ | • Tissue-specific interaction network database. |
| • Includes 987 Datasets encompassing 38 000 conditions describing 144 tissues types. | |
| + Integrates physical interaction, co-expression, miRNA binding motif and TF binding site data. | |
This is a non-comprehensive list of available tools and methods.
aThese databases can be queried for a gene or multiple genes of interest to identify commonly co-expressed genes across the samples the database was created from.
Figure 3Changes in gene co-expression patterns that can occur between samples. Differential co-expression can occur as the presence of a module in only one of the sample groups (A), as differences in the structure of the module (B) or as differences in the correlation strength between members of the modules (C). Additionally, differential co-expression can be detected if one larger interconnected module splits into several smaller ones (D) or if a group of genes changes its correlation partners [‘gene hopping’ (E)]. If sample groups are not defined before the differential co-expression analysis, or are unknown, biclustering methods can identify modules unique to a subpopulation of samples by simultaneously classifying the samples into groups in which these modules exist (F).
Figure 4Strategies for integrating multi-omics data with co-expression analyses. Networks are more informative if they are constructed using expression data specific to the tissue of interest. Genomic variation can be mapped to a co-expression network either by linking suggestive GWAS hits to the genes in the network or by first identifying genetic variants with an effect on gene expression levels (cis- and trans-eQTLs) and then mapping those to the co-expression network. Additional data layers may include TFBSs (based on binding motifs or ChIP-seq/ChIP-chip experiments), miRNA target binding sites (based on in silico predictions or experimental techniques) and established protein–protein interactions. A co-expression network can be used to identify modules, hub genes and for predicting the function of unknown trait-associated genes. Identified modules can be analysed by enrichment analyses to identify overlaying features. Additionally, the research hypothesis can be supported by additional differential expression, co-expression and methylation analyses that can be performed if respective omics data are available for cases and controls for a corresponding trait. eQTL: expression quantitative trait loci; GWAS: genome-wide association study; OMIM: online Mendelian inheritance in man; miRNA: microRNA; PPI: protein–protein interaction; TF: transcription factor; TFBS: TF binding site.