Literature DB >> 32931579

COGENT: evaluating the consistency of gene co-expression networks.

Lyuba V Bozhilova¹, Javier Pardo-Diaz¹, Gesine Reinert¹, Charlotte M Deane¹.

Abstract

SUMMARY: Gene co-expression networks can be constructed in multiple different ways, both in the use of different measures of co-expression, and in the thresholds applied to the calculated co-expression values, from any given dataset. It is often not clear which co-expression network construction method should be preferred. COGENT provides a set of tools designed to aid the choice of network construction method without the need for any external validation data.
AVAILABILITY AND IMPLEMENTATION: https://github.com/lbozhilova/COGENT. SUPPLEMENTARY INFORMATION: Supplementary information is available at Bioinformatics online.

Entities: Gene Mutation Species

Year: 2021 PMID： 32931579 PMCID： PMC8317122 DOI： 10.1093/bioinformatics/btaa787

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Gene expression data are a powerful resource for understanding genetic function under different conditions. A common way of exploring this data is through gene co-expression networks (Lee ). In these networks, genes are represented by nodes and highly co-expressed gene pairs are connected by edges. Such networks have been used in many ways, including for gene function prediction and the identification of disease- or tissue-relevant gene modules (van Dam ). Gene expression data typically take the form of a matrix, in which rows correspond to genes and columns correspond to samples. Network construction commonly consists of three steps—the data are pre-processed, a measure of co-expression is calculated for every pair of genes and a score cut-off is applied. Different approaches to data pre-processing and normalization exist (Abbas-Aghababazadeh ; Park ). Further, after normalization co-expression can be calculated in a number of different ways—e.g. via a correlation coefficient or mutual information. A score cut-off is then usually imposed in order to identify gene pairs which are highly co-expressed. Alternatively, weighted networks can be analysed, in which edge weights correspond to levels of co-expression (Langfelder and Horvath, 2008). There are many available methods for network construction which can be applied to the same dataset, and which then lead to different networks. Enrichment analysis or comparison to orthogonal data, such as protein interaction data, is commonly used for network selection and validation. However, for many species and datasets poor or non-existent functional annotation makes this type of validation difficult. It is therefore often not clear which of the available network construction methods should be prioritized (De Smet and Marchal, 2010). Here, we introduce COGENT (COnsistency of Gene Expression NeTworks), an R package designed to aid the choice of a network construction pipeline without the need for annotation or external data. COGENT can be used to choose between competing co-expression measures, as well as to inform score cut-off choice. While designed for gene expression data, COGENT can be applied to other cases where network construction relies on similarity profiling, e.g. microbiome or synthetic lethality data.

2 Software description

2.1 Method overview

Consistent co-occurrence of two gene products in the cell points towards a functional relationship between the genes. Gene product abundance, as well as co-occurrence, is a continuous-time phenomenon, which is experimentally observed at discrete time points or samples. The construction of a gene co-expression network can therefore be thought of as an estimation problem—we aim to infer general co-expression patterns from a limited set of data points. One way of investigating the success of such a procedure is through resampling. Networks constructed from a subset of all available samples will be noisier than the network constructed from the full dataset. However, they should still resemble each other: if subsetting the data results in networks with little to no overlap, then the network construction procedure may be too sensitive to noise in the data. COGENT evaluates network construction methods through iterative resampling. At each step, the gene expression samples are split into two possibly overlapping sets of equal size. The same network construction function is applied to both sets in order to obtain two gene co-expression networks and . COGENT then calculates several measures of consistency between these two networks. An example workflow for calculating consistency is shown in Figure 1; see Supplementary Section S2 for workflow details and Supplementary Section S4 for different consistency measures.

Fig. 1.

COGENT workflow schematic. In this example, the input data are a gene expression matrix with rows corresponding to genes—in this case —and columns corresponding to samples (far left). First, the expression matrix columns are randomly split into two possibly overlapping groups of equal size (left). Then, a network is constructed from each of the sample groups (right). Finally, the two resulting networks are compared and the consistency between them is calculated (far right). In this example, the two networks have six edges each, and overlap at four of these edges. One measure of their consistency is the Jaccard index between their edge sets (see Supplementary Section S4), which in this case is 0.50 The entire procedure is repeated multiple times in order to obtain robust results. Network construction methods which result in highly similar pairs of networks are considered to be consistent. When two or more competing methods are considered, the method exhibiting higher internal consistency should be preferred.

2.2 Edge set consistency

Consistency in COGENT is measured through a network comparison step at each iteration. The two networks G1 and G2 are considered to be similar if their edge sets are similar (). We measure agreement between E1 and E2 using a (weighted) Jaccard index to produce two measures of edge set consistency—global and local similarity (Kao and Porter, 2018; see Supplementary Section S4.1). Since global similarity scales with network density, a density adjustment is required when methods resulting in different network densities are compared. This is particularly important when choosing a score cut-off. Density adjustment in COGENT is carried out through comparison to random networks generated using a configuration model (see Supplementary Section S4.2).

2.3 Node metric consistency

Density-independent network comparison can also be performed by calculating a node metric such as degree or betweenness for all nodes in each of the two networks, and then comparing the obtained metric values. If the aim of downstream network analysis is e.g. to identify genes with many co-expression partners, the degree is a natural node metric to use. At each COGENT iteration, a node metric set by the user can be applied to G1 and G2, resulting in two node metric vectors d1 and d2, respectively. These two can be compared in three different ways: via a correlation coefficient, rank k-similarity (Bozhilova ; Trajanovski ) and Euclidean distance; see Supplementary Section S4.3 for details.

3 Application

Global and local similarity and adjusted edge consistency, as well as node metric comparisons can all be used to evaluate network consistency. By iteratively resampling the data and measuring network consistency, COGENT can be used to prioritize different network construction pipelines, as well as to inform co-expression cut-offs without the need for external validation data. A full worked example can be found in Supplementary Section S5 and in the COGENT tutorial. COGENT has also been used to assess signed distance correlation as a measure of gene co-expression (Pardo-Diaz ). This application further shows that network construction methods prioritized by COGENT also capture more protein–protein interaction data than methods which were not prioritized. While originally developed for gene expression data, COGENT can also be used for the inspection of other data types. For example, it can be applied to microbiome data in order to identify symbiotic organisms, or to synthetic lethality data in order to identify genetic interactions with high confidence.

Funding

This work was supported by the EPSRC [EP/L016044/1 to L.V.B. and C.M.D., EP/R512333/1 to J.P.D., G.R. and C.M.D.], the BBSRC [BB/T001801/1 to G.R.], COST [CA15109 to G.R.] and e-Therapeutics plc. Conflict of Interest: none declared. Click here for additional data file.

8 in total

1. Coexpression analysis of human genes across many microarray data sets.

Authors: Homin K Lee; Amy K Hsu; Jon Sajdak; Jie Qin; Paul Pavlidis
Journal: Genome Res Date: 2004-06 Impact factor: 9.043

Review 2. Advantages and limitations of current network inference methods.

Authors: Riet De Smet; Kathleen Marchal
Journal: Nat Rev Microbiol Date: 2010-08-31 Impact factor: 60.633

3. Gene co-expression analysis for functional classification and gene-disease predictions.

Authors: Sipko van Dam; Urmo Võsa; Adriaan van der Graaf; Lude Franke; João Pedro de Magalhães
Journal: Brief Bioinform Date: 2018-07-20 Impact factor: 11.622

4. Measuring rank robustness in scored protein interaction networks.

Authors: Lyuba V Bozhilova; Alan V Whitmore; Jonny Wray; Gesine Reinert; Charlotte M Deane
Journal: BMC Bioinformatics Date: 2019-08-28 Impact factor: 3.169

5. Robust gene coexpression networks using signed distance correlation.

Authors: Javier Pardo-Diaz; Lyuba V Bozhilova; Mariano Beguerisse-Díaz; Philip S Poole; Charlotte M Deane; Gesine Reinert
Journal: Bioinformatics Date: 2021-02-01 Impact factor: 6.931

6. WGCNA: an R package for weighted correlation network analysis.

Authors: Peter Langfelder; Steve Horvath
Journal: BMC Bioinformatics Date: 2008-12-29 Impact factor: 3.169

7. Evaluation of normalization methods for microarray data.

Authors: Taesung Park; Sung-Gon Yi; Sung-Hyun Kang; SeungYeoun Lee; Yong-Sung Lee; Richard Simon
Journal: BMC Bioinformatics Date: 2003-09-02 Impact factor: 3.169

8. Comparison of normalization approaches for gene expression studies completed with high-throughput sequencing.

Authors: Farnoosh Abbas-Aghababazadeh; Qian Li; Brooke L Fridley
Journal: PLoS One Date: 2018-10-31 Impact factor: 3.240