| Literature DB >> 35188184 |
Claudia Skok Gibbs1,2, Christopher A Jackson3,4, Giuseppe-Antonio Saldi3,4, Andreas Tjärnberg3,4, Aashna Shah1, Aaron Watters1, Nicholas De Veaux1, Konstantine Tchourine5, Ren Yi6, Tymor Hamamsy2, Dayanne M Castro3,4, Nicholas Carriero7, Bram L Gorissen8, David Gresham3,4, Emily R Miraldi9,10, Richard Bonneau1,2,3,4,6.
Abstract
MOTIVATION: Gene regulatory networks define regulatory relationships between transcription factors and target genes within a biological system, and reconstructing them is essential for understanding cellular growth and function. Methods for inferring and reconstructing networks from genomics data have evolved rapidly over the last decade in response to advances in sequencing technology and machine learning. The scale of data collection has increased dramatically; the largest genome-wide gene expression datasets have grown from thousands of measurements to millions of single cells, and new technologies are on the horizon to increase to tens of millions of cells and above.Entities:
Year: 2022 PMID: 35188184 PMCID: PMC9048651 DOI: 10.1093/bioinformatics/btac117
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.Learning GRNs with the Inferelator (A) The response to the sugar galactose in S.cerevisiae is mediated by the Gal4 and Gal80 TFs, a prototypical mechanism for altering cellular gene expression in response to stimuli. (B) Gal4 and Gal80 regulation represented as an unsigned directed graph connecting regulatory TFs to target genes. (C) Genome-wide GRNs are inferred from gene expression data and prior knowledge about network connections using the Inferelator, and the resulting networks are scored by comparison with a gold standard of known interactions. A subset of genes are held out of the prior knowledge and used for evaluating performance
Fig. 2.Network inference performance on multiple model organism datasets. (A) Schematic of Inferelator workflow and a brief summary of the differences between GRN model selection methods. (B) Results from 10 replicates of GRN inference for each modeling method on (i) B.subtilis GSE67023 (B1), GSE27219 (B2) and (ii) S.cerevisiae GSE142864 (S1), and Tchourine (S2). Precision–recall curves are shown for replicates where 20% of genes are held out of the prior and used for evaluation, with a smoothed consensus curve. The black dashed line on the precision–recall curve is the expected random performance based on random sampling from the gold standard. AUPR is plotted for each cross-validation result in gray, with mean ± standard deviation in color. Experiments labeled with (S) are shuffled controls, where the labels on the prior adjacency matrix have been randomly shuffled. A total of 10 shuffled replicates are shown as gray dots, with mean ± standard deviation in black. The blue dashed line is the performance of the GRNBOOST2 network inference algorithm, which does not use prior network information, scored against the entire gold standard network. (C) Results from 10 replicates of GRN inference using two datasets as two network inference tasks on (i) B.subtilis and (ii) S.cerevisiae. AMuSR is a multi-task-learning method; BBSR and StARS-LASSO are run on each task separately and then combined into a unified GRN. AUPR is plotted as in (B)
Fig. 3.Construction and performance of network connectivity priors using TF motif scanning. (A) Schematic of inferelator-prior workflow, scanning identified regulatory regions (e.g. by ATAC) for TF motifs to construct adjacency matrices. (B) Jaccard similarity index between S.cerevisiae prior adjacency matrices generated by the inferelator-prior package, by the CellOracle package, and obtained from the YEASTRACT database. Prior matrices were generated using TF motifs from the CIS-BP, JASPAR and TRANSFAC databases with each pipeline (n is the number of edges in each prior adjacency matrix). (C) The performance of Inferelator network inference using each motif-derived prior. Performance is evaluated by AUPR, scoring against genes held out of the prior adjacency matrix, based on inference using 2577 genome-wide microarray experiments. Experiments labeled with (S) are shuffled controls, where the labels on the prior adjacency matrix have been randomly shuffled. The black dashed line is the performance of the GRNBOOST2 algorithm, which does not incorporate prior knowledge, scored against the entire gold standard network
Fig. 4.Network inference performance using S.cerevisiae single-cell data. (A) Uniform Manifold Approximation and Projection plot of yeast scRNAseq data, colored by the experimental grouping of individual cells (tasks). (B) The effect of preprocessing methods on network inference using BBSR model selection on 14 task-specific expression datasets, as measured by AUPR. Colored dots represent mean ± standard deviation of all replicates. Data are either untransformed (raw counts), transformed by Freeman–Tukey Transform (FTT), or transformed by pseudocount. Non-normalized data are compared to data normalized so that all cells have identical count depth. Network inference performance is compared to two baseline controls; data, which have been replaced by Gaussian noise (N) and network inference using shuffled labels in the prior network (S). (C) Performance evaluated as in (B) on StARS-LASSO model selection. (D) Performance evaluated as in (B) on AMuSR model selection. (E) Precision–recall of a network constructed using FTT-transformed, non-normalized AMuSR model selection, as determined by the recovery of the prior network. Dashed red line is the retention threshold identified by MCC. (F) MCC of the same network as in (E). Dashed red line is the confidence score of the maximum MCC. (G) Performance evaluated as in (B) comparing the Inferelator (FTT-transformed, non-normalized AMuSR) against the SCENIC and CellOracle network inference pipelines. (H) Performance of the Inferelator (FTT-transformed, non-normalized AMuSR) compared to SCENIC and CellOracle without holding genes out of the prior knowledge network. Additional edges are added randomly to the prior knowledge network as a percentage of the true edges in the prior. Colored dashed lines represent controls for each method where the labels on the prior knowledge network are randomly shuffled. The black dashed line represents performance of the GRNBOOST2 algorithm, which identifies gene adjacencies as the first part of the SCENIC pipeline without using prior knowledge
Fig. 5.Processing large single-cell mouse brain data for network inference (A) Uniform Manifold Approximation and Projection plot of all mouse brain scRNAseq data with excitatory neurons, interneurons, glial cells and vascular cells colored. (B) Uniform Manifold Approximation and Projection plot of cells from each broad category colored by Louvain clusters and labeled by cell type. (C) Heatmap of normalized gene expression for marker genes that distinguish cluster cell types within broad categories. (D) Uniform Manifold Approximation and Projection plot of mouse brain scATAC data with excitatory neurons, interneurons and glial cells colored. (E) Heatmap of normalized mean gene accessibility for marker genes that distinguish broad categories of cells. (F) The number of scRNAseq and scATAC cells in each of the broad categories. (G) The number of scRNAseq cells in each cell-type-specific cluster
Fig. 6.Learned GRN for the mouse brain (A) MCC for the aggregate network based on Inferelator prediction confidence. The dashed line shows the confidence score which maximizes MCC. Network edges at and above this line are retained in the final network. (B) Aggregate GRN learned. (C) Network edges, which are present in every individual task. (D) Jaccard similarity index between each task network. (E) Network targets of the EGR1 TF in neurons. (F) Network targets of the EGR1 TF in both neurons and glial cells. (G) Network targets of the EGR1 TF in glial cells. (H) Network of the ATF4 TF where blue edges are neuron specific, orange edges are glial specific and black edges are present in both categories