| Literature DB >> 33301553 |
Lukas M Simon1, Fangfang Yan1, Zhongming Zhao1,2,3,4.
Abstract
BACKGROUND: Single-cell RNA sequencing (scRNA-seq) unfolds complex transcriptomic datasets into detailed cellular maps. Despite recent success, there is a pressing need for specialized methods tailored towards the functional interpretation of these cellular maps.Entities:
Keywords: Autoencoder; machine learning; manifold interpretation; single-cell RNA sequencing; transcription factor
Year: 2020 PMID: 33301553 PMCID: PMC7727875 DOI: 10.1093/gigascience/giaa122
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:DrivAER correctly identifies interferon response. (A) DrivAER iteratively subjects annotated gene sets to unsupervised dimension reduction via Deep Count Autoencoder (DCA). (B) For each gene set, the 2D data manifold coordinates are calculated and (C) subsequently used as input features in a random forest model to predict the outcome of interest (i.e., pseudotemporal ordering). (D) The random forest prediction accuracy represents the relevance score. (E) t-Distributed stochastic neighbor embedding (tSNE) visualization displays all peripheral blood mononuclear cells (PBMCs) colored by cell type. NK: natural killer. (F) Cellular map (tSNE) of T cell subset clusters by stimulation status. (G) Bar plot indicates relevance scores of the 5 most and least relevant transcription programs. DCA embeddings calculated based on “INTERFERON_GAMMA_RESPONSE” (H) and “PROTEIN_SECRETION” (I) (negative control) gene sets are depicted. Cells are colored by stimulation status. (J) Heat map shows gene expression of “INTERFERON_GAMMA_RESPONSE” target genes and cells in rows and columns, respectively. Columns are ordered first by stimulation status and second by DCA coordinates. Bars on top of the heat map represent stimulation status and DCA coordinates 1 and 2. Red and blue colors correspond to high and low relative expression values. Relative expression of interferon gene IFIT2 is overlaid on top of the DCA embeddings derived from “INTERFERON_GAMMA_RESPONSE” (K) and “PROTEIN_SECRETION” (L) gene sets. Dark colors indicate higher expression.
Figure 2:DrivAER unveils key transcription factors in blood development. PAGA (A) and cell-level graph (B) visualization of the Paul et al. [30] dataset. Cells are colored by Louvain clustering as provided by Scanpy. Two independent trajectories were calculated for erythrocyte (C) and monocyte (D) development. Cells are colored by pseudotime. (E) Bar plot displays relevance scores for the 5 most and least relevant transcription factors in the erythrocyte development trajectory. (F) DCA embedding plot was derived from the “GATA_C” gene set and is colored by pseudotime. (G) Heat map showing gene expression of cells and “GATA_C” target genes for the erythrocyte trajectory in columns and rows, respectively. (H) Bar plot displays relevance scores for the 5 most and least relevant transcription factors in the monocyte development trajectory. (I) DCA embedding plot was derived from the “PU1_Q6” gene set and is colored by pseudotime. (J) Heat map shows scaled gene expression of cells and “PU1_Q6” target genes for the monocyte trajectory in columns and rows, respectively. For both heat maps, columns are ordered by pseudotime. Bars on top of heat map indicate pseudotime, DCA coordinates 1 and 2. Red and blue colors reflect high and low expression values.
Figure 3:DrivAER identifies drivers underlying subtle transcriptional changes. (A) Two groups of single cells were simulated and gene sets were created by sampling a mixture of truly differentially expressed (DE) genes and random genes. (B) The global embedding using all genes is visualized using UMAP. (C) The DCA embedding for a gene set consisting of all truly DE genes is depicted. For both (B) and (C), cells are colored by group. (D) Relevance scores (y-axis) for gene sets ranging in the fraction of truly DE genes (x-axis) are displayed across implementations of DrivAER differing in the underlying dimension reduction methods. (E) Relevance scores (y-axis) for gene sets ranging in the fraction of truly DE genes (x-axis) are displayed using random forest (red) and support vector machine (SVM; blue) classification models. (F) Relevance scores (y-axis) for gene sets ranging in the fraction of truly DE genes (x-axis) are displayed across various configurations of the hidden layer. (G) Box plot shows significantly different relevance scores between 10 bootstrap runs of completely random gene sets (red) and gene sets consisting of 20% truly DE genes (blue) (1-sided t-test, P = 0.0467). The boxes represent the interquartile range, the horizontal line in the box is the median, and the whiskers represent 1.5 times the interquartile range. (H) PAGODA's adjusted z-scores (y-axis) are displayed for gene sets ranging in the fraction of truly DE genes (x-axis). (I) VISION's autocorrelation statistic is displayed for gene sets ranging in the fraction of truly DE genes (x-axis). (J) DrivAER (default parameters) relevance scores (y-axis) are displayed for gene sets ranging in the fraction of truly DE genes (x-axis). The horizontal dashed line indicates 0.5, the accuracy of random guesses for a binary outcome. For (D), (E), (F), (H), (I), and (J) lines represent the smoothed values and gray shading represents the 95% confidence interval derived from the smoothing fit.