| Literature DB >> 32414321 |
H Atakan Ekiz1,2, Christopher J Conley3, W Zac Stephens4,5, Ryan M O'Connell6,7.
Abstract
BACKGROUND: Single cell RNA sequencing (scRNAseq) has provided invaluable insights into cellular heterogeneity and functional states in health and disease. During the analysis of scRNAseq data, annotating the biological identity of cell clusters is an important step before downstream analyses and it remains technically challenging. The current solutions for annotating single cell clusters generally lack a graphical user interface, can be computationally intensive or have a limited scope. On the other hand, manually annotating single cell clusters by examining the expression of marker genes can be subjective and labor-intensive. To improve the quality and efficiency of annotating cell clusters in scRNAseq data, we present a web-based R/Shiny app and R package, Cluster Identity PRedictor (CIPR), which provides a graphical user interface to quickly score gene expression profiles of unknown cell clusters against mouse or human references, or a custom dataset provided by the user. CIPR can be easily integrated into the current pipelines to facilitate scRNAseq data analysis.Entities:
Keywords: Cluster analysis; Gene expression profiling; Identity prediction; Immune cells; Similarity; Single cell RNA-sequencing
Year: 2020 PMID: 32414321 PMCID: PMC7227235 DOI: 10.1186/s12859-020-3538-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Summary of the reference datasets included in CIPR
| Reference dataset | Species | Number of samples/ features | Number of cell types (main/fine) | Reference cell types | Ref |
|---|---|---|---|---|---|
| Immunological Genome Project (ImmGen) | 296/24197 | 20/296 | B cell, Baso, DC, Eosino, gd-T, Gran, ILC-1, ILC-2, ILC-3, Mac, Mast, Mono, NK, NKT, Pre-B, Pre-T, Stem-Prog, Stromal, T cell, Treg | [ | |
| Presorted cell RNAseq (various tissues) | 358/21214 | 18/28 | Adipocyte, Astrocyte, B cell, Cardiomyocyte, DC, Endothelial, Epithelial, Erythrocyte, Fibroblast, Gran, Hepatocyte, Mac, Microglia, Mono, Neuron, NK, Oligodendrocyte, T cell | [ | |
| Blueprint/ENCODE | 259/19859 | 24/43 | Adipocytes, B cell, T cell, Chondrocyte, DC, Endothelial, Eosino, Epithelial, Erythrocyte, Fibroblast, HSC, Keratinocyte, Mac, Melanocyte, Mesangial, Mono, Myocyte, Neuron, Neutro, NK cells, Pericyte, Skeletal muscle, Smooth muscle | [ | |
| Human Primary Cell Atlas | 713/19363 | 37/157 | Astrocyte, B cell, BM, Prog, Chondrocyte, CMP, DC, ESC, Endothelial, Epithelial, Erythroblast, Fibroblast, Gametocyte, GMP, Hepatocyte, HSC, iPS, Keratinocyte, Mac, MEP, Mono, MSC, Myelocyte, Neuroepithelial, Neuron, Neutro, NK, Osteoblast, Platelet, Pre/Pro-B, Smooth muscle, T cell, Tissue SC | [ | |
| Database of Immune Cell Expression (DICE) | 15*/57,773 | 5/15 | CD4+ T cell, CD8+ T cell, NK cell, B cell, Mono | [ | |
| Hematopoietic differentiation | 211/13276 | 17/38 | B cell, Baso, CD4+ T cell, CD8+ T cell, CMPs, DC, Eosino, Erythroid, GMP, Gran, HSC, Megakaryocyte, MEP, Mono, NK, NKT | [ | |
| Presorted cell RNAseq (PBMC) | 114/46077 | 11/29 | B cells, Baso, CD4+ T cell, CD8+ T cell, DC, Mono, Neutro, NK cells, Prog, T cell | [ |
Fig. 1CIPR provides a R/Shiny-powered graphical user interface to facilitate cluster annotation in scRNAseq experiments. a T-distributed stochastic neighbor embedding (t-SNE) plot for the example scRNAseq data derived from murine melanoma tumor infiltrating lymphocytes shows 15 distinct immune cell clusters within the tumor microenvironment (the dataset contains 13,985 features and 11,054 cells) [28]. To demonstrate the capabilities of CIPR we focus on clusters 05 and 15 which distinctly expressed (b) natural killer cell (NK) and (c) plasmacytoid dendritic cell (pDC) markers respectively. d We used the CIPR pipeline to score the gene expression profiles of cluster 15 (pDC) against 296 mouse immune cells found in the ImmGen reference. CIPR algorithm calculates a distinct identity score for each reference cell type and generates a graphical summary of the results. In these plots, 4 highest data points (red rectangle) correspond to pDC samples within the ImmGen reference. The shaded regions in the graphs delineate 1 and 2 standard deviations around the mean identity score calculated from the entire reference data frame. Data points are color-coded based on the reference cell type allowing an easy assessment of the results. e The CIPR results for cluster 05 (NK cells) is shown. Marked data points depict the NK cells in the ImmGen dataset that had the highest identity scores. Users can visualize graphs for each cluster separately and have the option of further manipulating the plots if the R package implementation of CIPR is used. f CIPR can also generate graphical outputs to summarize the 5 top-scoring reference samples for each experimental cluster. The scatter plot shows the pDC and NK cell subsets that had the highest scores for clusters 05 and 15. In Shiny implementation of CIPR, users can draw rectangles around these points to prompt a table output which provides further information about the reference cell types on the graph
Fig. 2Different analytical methods implemented in CIPR performs comparably to annotate single cell clusters. Three of the analytical methods in CIPR (logFC dot product, logFC Spearman’s or Pearson’s correlation) utilizes only differentially expressed genes in clusters. The recommended approach in CIPR is logFC dot product method since it takes both the direction and the amount of differential expression into account when calculating identity scores per cluster. The other approaches in CIPR are designed to analyze the expression profiles of all the genes in the experimental data regardless of their differential expression status. This figure compares the predictions of the logFC dot product method to other analytical approaches in CIPR. Data points in the scatter plots indicate the identity score of individual ImmGen reference cell subsets calculated for clusters 05 and 15 by different methods. As expected, there is a strong correlation between the results of logFC dot product method and (a) logFC Spearman’s and (b) logFC Pearson’s correlation methods for both clusters. c, d The same strong correlation was observed when the z-scores were compared for these methods, although logFC dot product differentiated the highest scoring reference subsets slightly better as evidenced by a higher z-score. The results of (e) all-genes Spearman’s and (f) all-genes Pearson’s methods show an overall positive correlation with those from logFC dot product method, although logFC dot product approach was able to better differentiate the top-scoring reference subsets as evidenced by higher z-scores shown in panels g and h. Similar observations were made for other clusters in the experimental dataset but are not shown due to space constraints
Fig. 3CIPR performs faster than other cluster analysis approaches and produces comparable results. a SingleR and scmap are recently described R packages for automated cluster analysis which can perform analyses at the cluster level similarly to the CIPR approach. These algorithms were shown to perform well in various experimental contexts and can serve as a high benchmark for automated cluster analysis solutions. By performing all the analyses at the cluster level, here we report a comparison of CIPR R package (v.0.1.0), SingleR (v1.0.5) and scmap (v1.8.0) in terms of predictions and performance. For these comparisons, a Surface Pro4 computer equipped with 64-bit Win7, 16 GB memory, 2.2GHz i7-6650U CPU, R (v.3.6.2), and RStudio (v.1.2.5033) was used with no other background processes. a Five analytical methods implemented in CIPR were compared to SingleR and scmap across 5 individual clusters. Data points indicate the identity scores calculated for each ImmGen reference cell subset by different methods. Color gradient specifies the identity score calculated by scmap method (gray indicates no significant mappings were found). As expected, CIPR’s all-gene Spearman’s/Pearson’s methods are highly concordant with SingleR pipeline. The results from CIPR logFC methods show an overall positive correlation with SingleR, where the highest scoring reference cell types in CIPR were similar to those calculated by SingleR and scmap. In some cases, scmap failed to find a significant association which may be due to its suboptimal power when a bulk reference data is used as input. b CIPR performs significantly faster than SingleR, and comparably to scmap in 5 separate tests. We benchmarked the runtime of SingleR function both with and without fine tuning feature. Scmap (short) measures the runtime of scmapcluster computational engine, whereas scmap (long) measures the runtime starting with the initial object creation. c CIPR utilizes less computer memory over time compared to (d) SingleR (no fine tuning) and (e) scmap
Fig. 4CIPR allows users to limit the analysis to highly variable reference genes to improve cluster annotations. As genes with variable expression profiles contain more information to discriminate cell types, we implemented a variance filtering parameter in CIPR. The user-defined variance threshold parameter instructs the algorithm to utilize the genes with variances above a certain quantile across the reference dataset, thus limiting the analysis to highly variable genes. Plots compare the CIPR results with or without variance thresholding when the all-genes Spearman’s method is used. Identity- and z-scores were calculated for clusters 05 (NK cells) and 15 (pDCs) using ImmGen reference and results for individual reference samples types are plotted as color-coded data points. Applying variance thresholding and increasing its stringency from top 10% to top 1% reduced the identity scores of low/intermediate-scoring reference cell subsets while the highest scoring reference cell subsets remained unaffected as evidenced by data points overlapping with y = x line for (a) cluster 05, and (b) cluster 15. Similar trends were observed for other clusters in analysis (not shown). The differential impact on identity scores of high- and low-scoring reference cell subsets lead to an increased z-score for the highest-scoring reference subsets for both (c) cluster 05 and (d) cluster 15. These findings suggest that variance thresholding can improve the discrimination of some reference cell subsets. Although the best thresholding value remains to be determined in individual studies, CIPR pipeline allows a level of flexibility to be adapted to different experimental contexts
Fig. 5Irrelevant reference subsets can be excluded to tailor CIPR pipeline to different user needs. CIPR pipeline allows users to easily exclude the reference subsets that are of no interest for the study at hand. Limiting the analysis only to the relevant reference subsets can increase the readability of the graphical outputs and may better differentiate closely related single cell clusters. To demonstrate this capability, we subsetted the scRNAseq dataset described in Fig. 1 to contain only T cells (as defined by the simultaneous expression of Cd3e and Cd4 or Cd8a marker genes). We then performed CIPR analyses with or without limiting the pipeline to T cell references within the ImmGen dataset. a Uniform manifold approximation and projection (UMAP) plot with 6 distinct single-cell clusters shows the heterogeneity within the T cell subsets in the tumor microenvironment. b Representative feature plots indicate that the clusters are composed of Cd4+ helper and Cd8a+ cytotoxic T cells some of which exhibited an activated phenotype (Ifng+ cells) while others appeared to have naïve-memory phenotype (Sell+ cells). Of note, cluster 06 is composed of Foxp3+ regulatory T cells (Tregs). c CIPR analysis using logFC dot product method shows that highest scoring reference subsets for cluster 06 are regulatory T cell subsets within the ImmGen reference data. d Graphs show that identity scores calculated by CIPR, SingleR and scmap are positively correlated for both cluster 01 (activated Cd8a+ cells) and cluster 06 (Tregs). For these analyses, the entire ImmGen reference data (296 samples spanning 20 different cell types) were used, and the calculations were performed at the cluster level as described above. e The positive correlation between different analytical approaches were stronger when the reference dataset was limited to T cell subsets (70 samples in ImmGen data). In general, the highest scoring reference cell subsets in CIPR also scored the highest in scmap and SingleR methods