| Literature DB >> 31856883 |
Jun Xu1, Caitlin Falconer2, Quan Nguyen2, Joanna Crawford2, Brett D McKinnon2,3, Sally Mortlock2, Anne Senabouth4, Stacey Andersen1,2, Han Sheng Chiu2, Longda Jiang2, Nathan J Palpant1,2, Jian Yang2,5, Michael D Mueller3, Alex W Hewitt6,7,8, Alice Pébay9,6,7, Grant W Montgomery1,2, Joseph E Powell10,4, Lachlan J M Coin11,12,13,14,15.
Abstract
A variety of methods have been developed to demultiplex pooled samples in a single cell RNA sequencing (scRNA-seq) experiment which either require hashtag barcodes or sample genotypes prior to pooling. We introduce scSplit which utilizes genetic differences inferred from scRNA-seq data alone to demultiplex pooled samples. scSplit also enables mapping clusters to original samples. Using simulated, merged, and pooled multi-individual datasets, we show that scSplit prediction is highly concordant with demuxlet predictions and is highly consistent with the known truth in cell-hashing dataset. scSplit is ideally suited to samples without external genotype information and is available at: https://github.com/jon-xu/scSplit.Entities:
Keywords: Allele fraction; Demultiplexing; Doublets; Expectation-maximization; Genotype-free; Hidden Markov Model; Machine learning; Unsupervised; scRNA-seq; scSplit
Mesh:
Year: 2019 PMID: 31856883 PMCID: PMC6921391 DOI: 10.1186/s13059-019-1852-7
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Overview of accuracy and performance of scSplit on simulated mixed samples, with one CPU and 30GB RAM
| Simulation | sim2 | sim3 | sim4 | sim8 | sim12 | sim16 | sim24 | sim32 |
|---|---|---|---|---|---|---|---|---|
| Mixed samples | 2 | 3 | 4 | 8 | 12 | 16 | 24 | 32 |
| Number of cells | 12 383 | 12 383 | 12 383 | 12 383 | 12 383 | 12 383 | 12 383 | 12 383 |
| Reads per cell | 4 973 | 4 973 | 4 973 | 4 973 | 4 973 | 4 973 | 4 973 | 4 973 |
| Informative SNVs | 34 116 | 34 116 | 34 116 | 34 116 | 34 116 | 34 116 | 34 116 | 34 116 |
| Assigning cells | 41 min | 41 min | 46 min | 47 min | 1h54m | 2h11 | 2h33 | 2h55 |
| Singlet TPR | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.96 | 0.96 |
| Singlet FDR | 0 | 9E −5 | 9E −5 | 9E −5 | 0 | 0 | 5E −3 | 8E −3 |
| Doublet TPR | 0.997 | 0.997 | 0.997 | 0.997 | 0.997 | 0.997 | 0.995 | 0.997 |
| Doublet FDR | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Cohen’s Kappa | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.97 | 0.98 |
We used PBMC donor B [2] and genotype data from demuxlet [6] as simulation templates
Fig. 1Results on simulated, merged hash-tagged scRNA-seq datasets confirmed scSplit a useful tool to demultiplex pooled single cells. a Confusion matrix showing scSplit demultiplexing results on simulated 2-, 3-, 4- and 8-mix; b TPR and FDR of for singlets and doublets predicted by scSplit and demuxlet compared to known truth before merging; c TPR and FDR of for singlets and doublets predicted by scSplit and demuxlet compared to cell hashing tags
Comparison of scSplit and demuxlet performance in demultiplexing merged three individually genotyped stromal samples (TPR true positive rate, FDR false discovery rate); Total cell numbers: 9567; Reads per cell: 14,495; Informative SNVs: 63,129; Runtime for matrices building: 67 min, Runtime for cell assignment: 55 min
| Predictions vs Truth | TPR | FDR | Cohen’s Kappa | |
|---|---|---|---|---|
| scSplit | Singlet | 0.94 | 0.02 | 0.95 |
| Doublet | 0.65 | 0.04 | ||
| demuxlet | Singlet | 0.93 | 0.02 | 0.77 |
| Doublet | 0.66 | 0.47 | ||
Comparison of scSplit and demuxlet performance in demultiplexing hashtagged and multiplexed eight individually genotyped PBMC samples (TPR true positive rate, FDR false discovery rate); total cell numbers: 7932; reads per cell: 5835; informative SNVs: 16,058; runtime for matrices building: 35 min, runtime for cell assignment: 20 min
| Predictions vs Truth | TPR | FDR | Cohen’s Kappa | |
|---|---|---|---|---|
| scSplit | Singlet | 0.98 | 0.13 | 0.75 |
| Doublet | 0.35 | 0.28 | ||
| demuxlet | Singlet | 0.79 | 0.10 | 0.74 |
| Doublet | 0.65 | 0.46 | ||
Fig. 2Results of scSplit on pooled PBMC scRNA-seq and that on a set of pooled fibroblast samples. a Singlet TPR and FDR compared to demuxlet predictions on pooled PBMC scRNA-seq. b Violin plot of singlet TPR and FDR for five 7- or 8-mixed samples based on scSplit vs demuxlet
Comparison of scSplit and demuxlet performance in demultiplexing multiplexed eight individually genotyped PBMC samples (TPR true positive rate, FDR false discovery rate); total cell numbers: 6145; reads per cell: 33,119; informative SNVs: 22,757; runtime for matrices building: 45 min; runtime for cell assignment: 35 min
| scSplit vs demuxlet | TPR | FDR | Cohen’s Kappa |
|---|---|---|---|
| Singlets | 0.80 | 0.18 | 0.63 |
| Doublets | 0.12 | 0.92 |
Overview of accuracy and performance running scSplit on five multiplexed scRNA-seq datasets, with one CPU and 30 GB RAM
| scSplit vs demuxlet | Mix 1 | Mix 2 | Mix 3 | Mix 4 | Mix 5 |
|---|---|---|---|---|---|
| Mixed samples | 7 | 8 | 8 | 8 | 7 |
| Number of cells | 914 | 8 137 | 5 165 | 6 977 | 7 428 |
| Reads per cell | 86 148 | 16 386 | 21 265 | 18 572 | 19 657 |
| informative SNVs | 15 848 | 26 830 | 26 162 | 23 224 | 41 993 |
| Build matrices | 10 min | 23 min | 18 min | 21 min | 35 min |
| Assign cells | 4 min | 47 min | 23 min | 45 min | 50 min |
| Singlet TPR | 0.94 | 0.93 | 0.94 | 0.93 | 0.93 |
| Singlet FDR | 0.06 | 0.07 | 0.06 | 0.07 | 0.07 |
| Doublet TPR | 0.52 | 0.17 | 0.15 | 0.17 | 0.08 |
| Doublet FDR | 0.48 | 0.83 | 0.85 | 0.83 | 0.92 |
| Cohen’s Kappa | 0.86 | 0.78 | 0.68 | 0.77 | 0.76 |
(TPR: True Positive Rate; FDR: False Discovery Rate)
Fig. 3Batch effect during sequencing runs found in comparison of individual runs was obvious compared to that in pooled scRNA-seq data. a UMAP for three individually sequenced samples. b UMAP for three individually sequenced and normalized samples. c UMAP for pooled sequencing of same three individual samples, samples marked based on demultiplexing results using scSplit. d UMAP for pooled sequencing of same three individual samples, normalized by total sample reads
Fig. 4The overall pipeline of scSplit tool. a SNV identified based on reads from all cells which have similar or different genotypes. b Alternative and reference allele count matrices built from each read in the pooled-sequenced BAM at the identified informative SNVs. c Initial allele fraction model constructed from the initial cell seeds and their allele counts. d Expectation-maximization process to find the most optimized allele fraction model, based on which the cells are assigned to clusters. e Presence/Absence matrix of alternative alleles generated from the cell assignments. f Minimum set of distinguishing variants found to be used to map clusters with samples