Literature DB >> 33124218

Reconstructing gene regulatory networks in single-cell transcriptomic data analysis.

Hao Dai^1,2, Qi-Qi Jin^1,3,4, Lin Li^1,3, Luo-Nan Chen^1,4,5,6.

Abstract

Gene regulatory networks play pivotal roles in our understanding of biological processes/mechanisms at the molecular level. Many studies have developed sample-specific or cell-type-specific gene regulatory networks from single-cell transcriptomic data based on a large amount of cell samples. Here, we review the state-of-the-art computational algorithms and describe various applications of gene regulatory networks in biological studies.

Entities: Chemical

Keywords: Cell-specific network; Cell-type-specific network; Computational algorithm; Gene regulatory network; Sample-specific network; Single-cell RNA sequencing

Mesh：

Year: 2020 PMID： 33124218 PMCID： PMC7671911 DOI： 10.24272/j.issn.2095-8137.2020.215

Source DB: PubMed Journal: Zool Res ISSN： 2095-8137

INTRODUCTION

Single-cell RNA sequencing (scRNA-seq) technology has made it possible to measure and compare gene transcriptomic profiles at single-cell resolution (Eberwine et al., 2014; Stegle et al., 2015). Based on scRNA-seq data, new cell types with distinct functions can be identified and cellular lineages during differentiation can be traced (Rozenblatt-Rosen et al., 2017; Villani et al., 2017). Many studies have focused on developing accurate and robust computational methods for scRNA-seq data analysis (Zeng & Dai, 2019), where a key problem is how to construct gene regulatory networks (GRNs) to pinpoint crucial factors, e.g., those that control cellular differentiation and determine phenotypes in disease progression (Iacono et al., 2019). In fact, scRNA-seq technology provides a large number of cell samples, making it possible to study gene-gene associations and transcriptional networks accurately (Dai et al., 2019). At present, many studies and algorithms have been developed to construct GRNs from scRNA-seq data based on various principles and perspectives.

INFERENCE METHODS FOR GRNs

Correlation networks

Correlation network analysis is one of the most widely used methods for scRNA-seq data. These networks measure gene-gene associations based on correlation coefficients and are suitable for large and high-dimensional datasets. In 2008, Langfelder & Horvath (2008) presented the popular weighted gene co-expression network (WGCNA) for weighted correlation network analysis. This method detects modules of highly correlated genes, identifies hub genes within these modules, and measures the relationships among the modules. Since then, many improved algorithms have been presented. For example, the partial and semi-partial correlation (PPCOR) method (Kim, 2015) substitutes the correlation coefficient in WGCNA by semi-partial correlation, which can measure the association between two variables after eliminating the effects of all other variables, i.e., the relationship between two genes is direct rather than influenced by other genes. The part mutual information (PMI) method (Zhang et al., 2015; Zhao et al., 2016) identifies direct associations based on the partial independence concept, and the partial information decomposition and context (PIDC) approach (Chan et al., 2017) uses partial information decomposition to determines the relationships between genes. These methods apply the concept of information theory and can measure nonlinear relationships. Usually, correlation networks are undirected, which means that the regulatory direction between two genes is unknown. However, the lag-based expression association for pseudotime-series (LEAP) approach (Specht & Li, 2017) can construct directed gene co-expression networks from pseudotime-ordered scRNA-seq data. This method computes the Pearson correlation coefficient over all possible time lags along the estimated pseudotime (no branch), and then uses the maximum correlation to construct the network. LEAP can capture the associations hidden by time lags and provides more accurate GRNs. Single-cell regularized inference using time-stamped expression profiles (SINCERITIES) (Gao et al., 2018) is another approach applied to construct directed GRNs from time-stamped single-cell transcriptional expression profiles. This method divides the single-cell data into several time points, uses Granger causality to infer regulatory networks centered by transcription factors (TFs), and uses ridge regression and partial correlation analyses to recover the directed regulatory relationships among genes. More accurate GRNs can be provided by this method. Several methods use statistical likelihood or Bayesian networks to infer GRNs, e.g., context likelihood of relatedness (CLR) (Faith et al., 2007) and first-order autoregressive moving-average and variational Bayesian expectation-maximization (AR1MA1-VBEM) (Sanchez-Castillo et al., 2018). These methods are similar to correlation networks but are considered more accurate for measuring gene-gene relationships based on their nonlinear principles.

Dynamic networks

In comparison to static (correlation) networks, dynamic networks are more suitable for describing changes in network dynamics, such as cellular lineages during differentiation. The Boolean model is one of the simplest methods, which takes the value of 0 or 1 to represent the absence or presence of gene expression and uses the Boolean operators AND, OR, and NOT to describe the interaction between two genes. The Boolean model is more robust to the effects of dropout, which makes it quite useful for scRNA-seq data. Many methods have reconstructed GRNs based on synchronous or asynchronous Boolean models, such as reduced ordered binary decision diagrams (ROBDD) (Garg et al., 2008), cellular network optimizer (CellNOptR) (Terfve et al., 2012), bool trainer (BTR) (Lim et al., 2016), single-cell network synthesis toolkit (SCNS) (Woodhouse et al., 2018), and gene modular network (GMN) (Zhang et al., 2020), which have been applied to find key regulators of cell fate and reveal network rewiring during cell differentiation (Moignard et al., 2015; Xu et al., 2014). However, the Boolean model must convert expression data into binary data, which may obscure gene-gene interactions. In contrast, differential equation-based models are more complex and offer high-precision predictions (Chen et al., 2009, 2010; Wang et al., 2006), but these methods must balance time complexity and prediction accuracy. Matsumoto et al. (2017) presented a highly efficient optimization algorithm (single-cell ordinary differentiation equations, SCODE) to reconstruct expression dynamics and infer GRNs from differentiating cells. This method integrates the transformation of linear ordinary differential equations (ODEs) and linear regression and can reconstruct the observed expression dynamics and GRNs accurately with remarkable efficiency.

Tree-based networks

Huynh-Thu et al. (2010) developed a tree-based algorithm (gene network inference with ensemble of trees, GENIE3), which adopted a distinctive way to infer regulatory networks. This method decomposes the prediction of GRNs into p regression models constructed by tree-based ensemble methods, e.g., Random Forests or Extra-Trees, where p is the number of genes. In each regression model, the expression pattern of gene x (target gene) is predicted from all other genes (input genes), and the weight of interaction between the target and input genes is determined by the importance of each input gene in the regression model. Several improvements to GENIE3 have been developed. For example, the GRN inference based on gradient boosting machine (GRNBoost2) method (Moerman et al., 2019) uses gradient boosting with GENIE3 architecture to improve algorithm efficiency. Jump3 (Huynh-Thu & Sanguinetti, 2015) combines the tree-based algorithm and dynamic systems to infer GRNs by exploiting the time series of expression data. The single-cell regulatory network inference and clustering (SCENIC) method (Aibar et al., 2017) removes the indirect targets from the GENIE3 modules based on TF motif enrichment analysis, and only retains those modules with enriched TF-binding motifs, called regulons. Generally, these methods are competitive with correlation models, and are able to construct directed networks (Chen & Mar, 2018; Pratapa et al., 2020).

Deep-learning-based networks

Deep-learning frameworks have also been used to infer gene relationships. The convolutional neural network for coexpression (CNNC) approach (Yuan & Bar-Joseph, 2019) is a supervised and task-specific method, in which the network is trained by positive and negative samples, e.g. known targets of TFs, known pathways for specific biological processes, and known disease genes. Based on the data types used for training, CNNC can predict TF targets and identify disease-related genes.

Cell-specific networks

Recently, Dai et al. (2019) presented a new cell-specific network (CSN) method that can construct a network for each single cell from scRNA-seq data by considering statistical independence. Different from all other approaches, this method can identify gene-gene interactions and describe network heterogeneity at the single-cell level. CSN may help to find new cell types from a network perspective and reveal “dark” genes that play important roles in the network but are generally ignored by traditional differential analyses. Moreover, by considering partial independence, the conditional cell-specific network (CCSN) approach (Li et al., 2020) was developed to further reduce false positives in CSN method.

APPLICATION OF GENE REGULATORY NETWORKS

Method selection

The methods listed in this paper have their own advantages and disadvantages. How to choose the best method primarily depends on the scientific problem to be addressed. If the study focuses on time-series-related problems, such as development, cell differentiation, or disease procedures, the first choice would be an algorithm that constructs GRNs based on time-ordered data. If we only compare differences between two samples, e.g. the difference between a disease and normal state, and the difference before and after medication, the algorithm based on static data should be selected. Directed networks provide information on the direction of a regulatory relationship, whereas undirected networks only measure the existence and strength of a regulatory relationship. Nonlinear algorithms can predict the strength of a regulatory relationship more accurately, but computational time will be longer; linear algorithms reduce the computational time, but accuracy also declines; binary algorithms can only show whether or not a regulatory relationship exists, but they are the fastest. Thus, if the purpose of a study is to explore the key regulatory factors controlling a biological process, research should focus on the changes or differences in network structure, instead of the strength of the regulatory relationship, and thus binary algorithms may be preferred. If we know certain regulatory factors are important and hope to identify their upstream and downstream genes, we can choose a directed network with linear or nonlinear algorithm. If we hope to simulate a biological process through a network, for example, by deleting network nodes to simulate gene knockout processes, then nonlinear algorithms are necessary. In addition, although the latest algorithms are often better than earlier ones, it is still important to build and compare networks constructed by different principles. Tables 1lists the type and principle of each method, and Table 2 lists the code and source of each method for reference.

Table 1

Summary of inference methods for gene regulatory networks

Method	Type of edge		Input data	Principle	References
WGCNA	Linear	Undirected	Static	Pearson correlation	Langfelder & Horvath, 2008
PPCOR	Linear	Undirected	Static	Semi-partial correlation	Kim, 2015
PMI	Nonlinear	Undirected	Static	Part mutual information	Zhang et al., 2015; Zhao et al., 2016
PIDC	Nonlinear	Undirected	Static	Partial information decomposition	Chan et al., 2017
LEAP	Linear	Directed	Time-ordered	Pearson correlation	Specht & Li, 2017
SINCERITIES	Linear	Directed	Time-ordered	Ridge regression and partial correlation	Gao et al., 2018
AR1MA1 -VBEM	Nonlinear	Directed	Time-ordered	Bayesian framework	Sanchez-Castillo et al., 2018
ROBDD	Binary	Directed	Time-ordered	Boolean model	Garg et al., 2008
CellNOptR	Binary	Directed	Time-ordered	Boolean model	Terfve et al., 2012
BTR	Binary	Directed	Time-ordered	Boolean model	Lim et al., 2016
SCNS	Binary	Directed	Time-ordered	Boolean model	Woodhouse et al., 2018
SCODE	Nonlinear	Directed	Time-ordered	Ordinary differentiation equations	Matsumoto et al., 2017
GENIE3	Nonlinear	Directed	Static	Random Forests or Extra-Trees	Huynh-Thu et al., 2010
Jump3	Nonlinear	Directed	Time-ordered	Decision trees	Huynh-Thu & Sanguinetti, 2015
SCENIC	Nonlinear	Directed	Static	GENIE3 and TF motif enrichment analysis	Aibar et al., 2017
GRNBoost2	Nonlinear	Directed	Static	GENIE3 and gradient boosting	Moerman et al., 2019
CNNC	Nonlinear	Undirected	Static	Deep learning	Yuan & Bar-Joseph, 2019
CSN	Nonlinear	Undirected	Static / Time-ordered	Statistic independency	Dai et al., 2019
CCSN	Nonlinear	Undirected	Static / Time-ordered	Statistically partial independency	Li et al., 2020

Table 2

Sources of GRN inference methods

Method	Code	Source
WGCNA	R	R package: WGCNA
PPCOR	R	R package: ppcor
PMI	MATLAB	http://www.sysbio.ac.cn/cb/chenlab/software/PCA-PMI
PIDC	Julia	https://github.com/Tchanders
LEAP	R	R package: LEAP
SINCERITIES	R / MATLAB	http://www.cabsel.ethz.ch/tools/sincerities.html, https://github.com/CABSEL/SINCERITIES
AR1MA1- VBEM	MATLAB	https://github.com/mscastillo/GRNVBEM
ROBDD	Java	http://si2.epfl.ch/~garg/genysis.html
CellNOptR	R	http://www.bioconductor.org/packages/release/bioc/html/CellNOptR.html
BTR	R	R package: BTR
SCNS	R	https://github.com/swoodhouse/SCNS-GUI
SCODE	R	https://github.com/hmatsu1226/SCODE
GENIE3	R	http://www.montefiore.ulg.ac.be/~huynh-thu/software.html
Jump3	MATLAB	http://homepages.inf.ed.ac.uk/vhuynht/software.html
SCENIC	R	http://scenic.aertslab.org
GRNBoost2	Python	http://arboreto.readthedocs.io
CNNC	Python	https://github.com/xiaoyeye/CNNC
CSN	MATLAB	https://github.com/wys8c764/CSN
CCSN	MATLAB	http://sysbio.sibcb.ac.cn/cb/chenlab/soft/CCSN.zip

In this paper, we selected several widely used algorithms to test whether they can identify proven gene regulatory relationships (Liu et al., 2020; Van Dijk et al., 2018). As shown in Table 3, most methods identified all six regulatory relationships, although two linear methods WGCNA and SINCERITIES did not perform well. This result is not unexpected as nonlinear algorithms usually predict gene regulatory relationships more accurately.

Table 3

Comparison of GRN inference methods

Method	Proven gene regulatory relationship
	GSE114397			GSE139343
	VIM- ZEB1	VIM- SNAI2	VIM- MYC	ARID1A-ZIC1	ARID1A-SOX1	ARID1A-MAP2
WGCNA	×	×	×	×	√	×
PPCOR	√	√	√	√	√	√
PMI	√	√	√	√	√	√
LEAP	√	√	√	√	√	√
SINCERITIES	√	×	×	×	√	√
SCODE	√	√	√	√	√	√
SCENIC	√	√	√	√	×	√
CSN	√	√	√	√	√	√

GRN analysis in biological studies

All methods listed in this paper use scRNA-seq data as input. Most GRN analyses need some prior information; algorithms based on time-ordered data need time-series information and algorithms based on static data need cell-type information. Both CSN and CCSN construct one network for each single cell, so they are suitable for either time-ordered data or static data. Some widely used data analysis software, e.g., Seurat (Butler et al., 2018) and Monocle (Qiu et al., 2017; Trapnell et al., 2014), can help obtain cell-type or time-series information based on clustering or pseudo-time analyses. No matter which algorithm is used, network analysis is generally similar. For each network, the first step is to identify the modules it contains. Generally, a module represents a functional unit, as the genes performing the same function are often closely related to each other. In each module, the number of edges connected to a node, i.e., network degree, is an important indicator. If the network degree of a certain gene shows considerable differences between a disease and normal state, or shows significant changes during cell differentiation, this gene may be an important regulatory factor. If regulatory factors are known, the genes related to these factors should be considered. In addition, it should be noted that the genes linking different modules are often very important. Gene regulatory networks have been widely used in biological studies. For example, based on correlation network analysis, Pina et al. (2015) identified a key regulatory gene (Ddit3) in erythroid lineage programming and found the Ddit3-Gata2 regulatory axis could antagonize myeloid programs and enabled erythroid programs, which was validated experimentally. Xu et al. (2014) constructed Boolean networks composed of 30 genes related to the self-renewal and pluripotency of mouse embryonic stem cells (mESCs) cultured in serum/LIF or serum-free 2i/LIF conditions. They removed nodes from the Boolean network to simulate single and combinatorial RNA interference (RNAi) knockdown, with the predicted post-RNAi expression levels based on network analysis showing good agreement with experimental testing. In addition, Moignard et al. (2015) used diffusion maps to identify the developmental trajectory of the mesoderm toward blood in mouse based on scRNA-seq data, and then constructed Boolean networks to recapitulate blood development. The model predicted that the Erg gene is activated by Sox17 or Hoxb4, which were validated by the observations that Sox and Hox factors control early expression of Erg. Harly et al. (2019) used LEAP to identify target genes of TCF-1 during innate lymphoid cell (ILC) development, and identified the role of TCF-1 in developmental progress of ILC precursors. Sagar et al. (2020) established a γδ T-cell differentiation map based on fetal and adult thymus scRNA-seq data using GENIE3 to construct GRNs and illustrate fetal and adult differences. Differentially expressed gene networks have also been successfully applied to recover and characterized distinct stages of γδ T-cell differentiation. Elyanow et al. (2020) presented a new computational method (netNMF-sc) using gene-gene co-expression networks as prior knowledge to perform dimensionality reduction and imputation of scRNA-seq data with high dropout rates, which was competitive with many other methods for dimension reduction and imputation.

FUTURE PERSPECTIVES

Although many GRN methods have been developed, GRN inference remains a challenging problem in bioinformatics and computational biology. A critical issue is the low quality of single-cell sequencing data. As RNA is obtained from only one cell, noise from amplification and dropout events in sequencing is a common problem. Recently, the integration of various single-cell-omics data, such as ATAC-seq and ChIP-seq, has attracted increasing attention (Li et al., 2017; Mimitou et al., 2019; Stuart et al., 2019), which may help in the development of next-generation GRN inference algorithms for various fields, including developmental and evolutionary biology.

COMPETING INTERESTS

The authors declare that they have no competing interests.

AUTHORS’ CONTRIBUTIONS

H.D. and L.N.C. conceived the review. H.D. prepared the draft. H.D., L.L., and Q.Q.J. collected the materials. All authors contributed to the discussions. All authors read and approved the final version of the manuscript.

44 in total

1. Inferring gene regulatory networks from multiple microarray datasets.

Authors: Yong Wang; Trupti Joshi; Xiang-Sun Zhang; Dong Xu; Luonan Chen
Journal: Bioinformatics Date: 2006-07-24 Impact factor: 6.937

2. Conditional mutual inclusive information enables accurate quantification of associations in gene regulatory networks.

Authors: Xiujun Zhang; Juan Zhao; Jin-Kao Hao; Xing-Ming Zhao; Luonan Chen
Journal: Nucleic Acids Res Date: 2014-12-24 Impact factor: 16.971

3. Evaluating methods of inferring gene regulatory networks highlights their lack of performance for single cell gene expression data.

Authors: Shuonan Chen; Jessica C Mar
Journal: BMC Bioinformatics Date: 2018-06-19 Impact factor: 3.169

4. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles.

Authors: Jeremiah J Faith; Boris Hayete; Joshua T Thaden; Ilaria Mogno; Jamey Wierzbowski; Guillaume Cottarel; Simon Kasif; James J Collins; Timothy S Gardner
Journal: PLoS Biol Date: 2007-01 Impact factor: 8.029

Reconstructing gene regulatory networks in single-cell transcriptomic data analysis.

INTRODUCTION

INFERENCE METHODS FOR GRNs

Correlation networks

Dynamic networks

Tree-based networks

Deep-learning-based networks

Cell-specific networks

APPLICATION OF GENE REGULATORY NETWORKS

Method selection

GRN analysis in biological studies

FUTURE PERSPECTIVES

COMPETING INTERESTS

AUTHORS’ CONTRIBUTIONS

1. Inferring gene regulatory networks from multiple microarray datasets.

2. Conditional mutual inclusive information enables accurate quantification of associations in gene regulatory networks.

3. Evaluating methods of inferring gene regulatory networks highlights their lack of performance for single cell gene expression data.

4. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles.

5. Single-Cell Network Analysis Identifies DDIT3 as a Nodal Lineage Regulator in Hematopoiesis.

6. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells.

7. Construction and validation of a regulatory network for pluripotency and self-renewal of mouse embryonic stem cells.

8. Analysis on gene modular network reveals morphogen-directed development robustness in Drosophila.

9. CellNOptR: a flexible toolkit to train protein signaling networks to data using multiple logic formalisms.

10. Synchronous versus asynchronous modeling of gene regulatory networks.

1. Coevolutionary insights between promoters and transcription factors in the plant and animal kingdoms.

2. Single-cell entropy network detects the activity of immune cells based on ribosomal protein genes.

3. Inference of Molecular Regulatory Systems Using Statistical Path-Consistency Algorithm.