| Literature DB >> 32513247 |
Zijian Ni1, Shuyang Chen1, Jared Brown1, Christina Kendziorski2.
Abstract
An important challenge in pre-processing data from droplet-based single-cell RNA sequencing protocols is distinguishing barcodes associated with real cells from those binding background reads. Existing methods test barcodes individually and consequently do not leverage the strong cell-to-cell correlation present in most datasets. To improve cell detection, we introduce CB2, a cluster-based approach for distinguishing real cells from background barcodes. As demonstrated in simulated and case study datasets, CB2 has increased power for identifying real cells which allows for the identification of novel subpopulations and improves the precision of downstream analyses.Entities:
Keywords: Cell detection; Droplet-based protocols; Single-cell RNA-seq
Mesh:
Year: 2020 PMID: 32513247 PMCID: PMC7278076 DOI: 10.1186/s13059-020-02054-8
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Overview of CB2. a Projection of a hypothetical cell population containing three subpopulations (red, green, and blue where intensity corresponds to read depth). CB2 takes as input a gene by barcode matrix of UMI counts and returns a gene by cell matrix. b High-count barcodes with counts above a pre-specified upper threshold are considered real cells; barcodes with counts below a lower threshold are used to estimate a background distribution (Additional file 1: Figure S2). The remaining barcodes are clustered, and tight clusters are tested as a group against the estimated background distribution; barcodes not in tight clusters are tested individually (not shown). High-count barcodes and those identified by CB2 are retained for downstream analysis
Fig. 2Results from the Alzheimer dataset. a t-SNE plot of cells identified by CB2 and ED. High-count barcodes exceeding an upper threshold are identified as real cells by both methods without a statistical test (dark pink); barcodes identified as cells by both methods following statistical test are shown in pink. Cells identified uniquely by CB2 (yellow) and ED (black) are also shown. CB2 identifies an increased number of cells in existing subpopulations (Subpop1–Subpop4) and also identifies a novel subpopulation (Subpop5). b Distribution plots of the 100 genes having highest average expression in Subpop1 are shown for cells identified by both CB2 and ED (upper) and identified uniquely by CB2 (middle). The estimated background distribution is also shown (lower). Cells uniquely identified by CB2 in Subpop1 have a distribution similar to other Subpop1 cells and differ from the background. c Heatmap of log transformed raw UMI counts for the same 100 genes for barcodes identified by CB2 and ED (left) and barcodes uniquely identified by CB2 (right). d t-SNE plots of cells colored by neuron marker genes SYT1, SNAP25, and GRIN1 in all cells (upper) and those identified uniquely by CB2 (lower)
Fig. 3Results from the PBMC8K dataset. a t-SNE plot of cells identified by CB2 and ED. High-count barcodes exceeding an upper threshold are identified as real cells by both methods without a statistical test (dark pink); barcodes identified as cells by both methods following statistical test are shown in pink. Cells identified uniquely by CB2 (yellow) and ED (black) are also shown. CB2 increases the number of cells identified across the six subpopulations by over 80% (Additional file 2: Table S1). b Subpopulations 1–5 ordered by median normalized UMI count along with marker gene expression for each subpopulation. Marker gene expression in cells uniquely identified by CB2 is similar to that in other groups, and differs from the background. Subpopulation 5 contained no high-count common cells; subpopulation 6 contained no unique CB2 identifications and is therefore not shown in panel b