| Literature DB >> 32765839 |
Rui Fu1, Austin E Gillen1, Ryan M Sheridan1, Chengzhe Tian2, Michelle Daya3, Yue Hao4, Jay R Hesselberth1,5, Kent A Riemondy1.
Abstract
Assignment of cell types from single-cell RNA sequencing (scRNA-seq) data remains a time-consuming and error-prone process. Current packages for identity assignment use limited types of reference data and often have rigid data structure requirements. We developed the clustifyr R package to leverage several external data types, including gene expression profiles to assign likely cell types using data from scRNA-seq, bulk RNA-seq, microarray expression data, or signature gene lists. We benchmark various parameters of a correlation-based approach and implement gene list enrichment methods. clustifyr is a lightweight and effective cell-type assignment tool developed for compatibility with various scRNA-seq analysis workflows. clustifyr is publicly available at https://github.com/rnabioco/clustifyr. Copyright:Entities:
Keywords: R package; Single-cell RNA sequencing; cell type classification; gene expression profile
Year: 2020 PMID: 32765839 PMCID: PMC7383722 DOI: 10.12688/f1000research.22969.2
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Figure 1. Schematic for clustifyr input and output.
Collection of datasets used for introducing and benchmarking clustifyr.
A description of single cell RNA-seq, bulk RNA-seq, and microarray datasets used in this study. The datasets available through ExperimentHub are references that were built from raw or downloaded data and can be used with clustifyr. R objects can be accessed using the direct download URLs to the .rda files, or through the clustifyrdatahub ExperimentHub.
| Description | # of
| Organism | Publication | Source | Data Provider | R object download URL
[ | Bioconductor
| R object
|
|---|---|---|---|---|---|---|---|---|
| Mouse Cell
| 713 | mouse |
|
| figshare |
| EH3444 | ref_MCA |
| Tabula Muris
| 112 | mouse |
|
| figshare |
| EH3445 | ref_tabula_
|
| Tabula Muris
| 175 | mouse |
|
| figshare |
| EH3446 | ref_tabula_
|
| Mouse RNA-seq
| 28 | mouse |
|
| GitHub |
| EH3447 | ref_mouse.
|
| Mouse
| 37 | mouse |
|
| washington.edu |
| EH3448 | ref_moca_
|
| Mouse sorted
| 253 | mouse |
|
| GitHub |
| EH3449 | ref_immgen |
| Human
| 38 | human |
|
| GEO |
| EH3450 | ref_hema_
|
| Human cortex
| 47 | human |
|
| UCSC |
| EH3451 | ref_cortex_
|
| Human
| 14 | human |
|
| S3 |
| EH3452 | ref_pan_
|
| Human
| 12 | human |
|
| S3 |
| EH3453 | ref_pan_
|
| Human PBMCs,
| 9 | human |
|
| Zenodo |
| NA | NA |
| Human PBMCs,
| 5,7,10 | human |
|
| Zenodo |
| NA | NA |
| Mouse anterior
| 34 | mouse |
|
| Allen Brain
| NA | NA | NA |
| Mouse brain
| 34 | mouse |
|
| Allen Brain
| NA | NA | NA |
| Human PBMC
| 5 | human |
|
| Investigator | NA | NA | NA |
| Human CBMC
| 13 | human |
|
| GEO | NA | NA | NA |
| Human PBMCs
| 9 | human |
|
| 10x Genomics |
| NA | NA |
1download URL to access R object (if available)
2R object id in the clustifyrdatahub Bioconductor Experiment hub
3R object name (if available via clustifyrdatahub)
Figure 2. Parameter considerations for clustifyr.
A) Comparison of median F1-scores of different correlation methods for classifying across platforms using the PBMC-bench dataset. B) Heatmap showing correlation coefficients between query cell types and the reference cell types from a rejection test, whereby megakaryocytes were excluded from the reference dataset. The Neg.Cell cluster is megakaryocytes, which is correctly not annotated a different cell type when megakaryocytes are not present in the reference. By default clusters with correlation < 0.50 are assigned as “unassigned” by clustifyr. C) Comparison of correlation coefficients with and without feature selection when comparing average gene expression per cell type between two pancreas scRNA-seq datasets. The “unclassified” cell type was not defined in the Segerstolpe et al dataset. D) Accuracy (defined as the ratio between the number of correctly classified clusters and the overall number of clusters) and performance were assessed with decreasing query cluster cell numbers using the Tabula Muris as the query dataset and the Mouse cell atlas as the reference dataset. E) Example of overclustering the query data and assigning cell types for data exploration. UMAP of PBMC dataset generated by 10x Genomics with cell types assigned by comparing to reference data from CBMC cells from Stoeckius et al. 2017. F) An assessment of the median F1-score when using single or multiple averaged profiles as reference cell types was conducted using the PBMC-bench test set. The number of reference expression profiles to generate for each cell type is determined by the number of cells in the cluster (n), and the sub-clustering power argument (x), with the formula n x.
Figure 3. clustifyr can utilize multiple reference data types.
UMAP projections of PBMCs showing the ground truth cell types ( A), or cell types called by clustifyr using microarray data from sorted immune cell types ( B), bulk RNA-seq from immune cell populations ( C) or scRNA-seq data from CBMCs ( D).
Figure 4. clustifyr accurately and rapidly annotates cell types.
A) Accuracy and run-time of classifications generated by clustifyr or existing methods using the Tabula Muris dataset to benchmark cell type classifications between datasets generated with the Smart-Seq2 or 10x Genomics sequencing platforms. Each point represents a different tissue comparison. clustifyr (m3drop) indicates clustifyr run using variable genes defined by M3drop, clustifyr_lists (hyper) uses hypergeometric tests to compare marker gene lists, and clustifyr_lists(jaccard) calculates the jaccard index between marker gene lists to annotate cell types. B) Performance comparison of clustifyr to existing methods with random subsamples of cells from the Smart-Seq2 Tabula Muris dataset. Error bars represent standard error of the mean and are derived from 5 independent subsamples of the dataset. C) Performance comparison of clustifyr to existing methods testing classification of an Allen Institute Brain Atlas dataset from two murine brain regions that contain 34 cell types. scPred is not shown as it failed with an error on this dataset. D) Comparing clustifyr to existing methods for rejecting unseen populations using PBMC data. Three reference PBMC datasets were generated that excluded either T-cells, CD4+ T-cells or memory T-cells respectively. The % of rejected indicates the % of the indicated cell type that was not misclassified when the cell type was missing from the reference.
| Dataset | Source |
|---|---|
| PBMC 3k Seurat V3
|
|
| CBMC CITE-seq | Accession number, GSE100866:
|
| Hematopoiesis
| Accession number, GSE24759:
|
| Tabula Muris as
|
|
| Mouse Cell Atlas |
|
| Pancreatic
|
|
| Allen Institute Brain
|
|
| PBMC-bench |
|
| PBMC rejection test |
|
| ImmGen Database |
|