Literature DB >> 31856883

Genotype-free demultiplexing of pooled single-cell RNA-seq.

Jun Xu¹, Caitlin Falconer², Quan Nguyen², Joanna Crawford², Brett D McKinnon^2,3, Sally Mortlock², Anne Senabouth⁴, Stacey Andersen^1,2, Han Sheng Chiu², Longda Jiang², Nathan J Palpant^1,2, Jian Yang^2,5, Michael D Mueller³, Alex W Hewitt^6,7,8, Alice Pébay^9,6,7, Grant W Montgomery^1,2, Joseph E Powell^10,4, Lachlan J M Coin^{11,12,13,14,15}.

Abstract

A variety of methods have been developed to demultiplex pooled samples in a single cell RNA sequencing (scRNA-seq) experiment which either require hashtag barcodes or sample genotypes prior to pooling. We introduce scSplit which utilizes genetic differences inferred from scRNA-seq data alone to demultiplex pooled samples. scSplit also enables mapping clusters to original samples. Using simulated, merged, and pooled multi-individual datasets, we show that scSplit prediction is highly concordant with demuxlet predictions and is highly consistent with the known truth in cell-hashing dataset. scSplit is ideally suited to samples without external genotype information and is available at: https://github.com/jon-xu/scSplit.

Entities: Chemical Disease Species

Keywords: Allele fraction; Demultiplexing; Doublets; Expectation-maximization; Genotype-free; Hidden Markov Model; Machine learning; Unsupervised; scRNA-seq; scSplit

Mesh：

Year: 2019 PMID： 31856883 PMCID： PMC6921391 DOI： 10.1186/s13059-019-1852-7

Source DB: PubMed Journal: Genome Biol ISSN： 1474-7596 Impact factor: 13.583

Background

Using single-cell RNA sequencing (scRNA-seq) to cell biology at cellular level provides greater resolution than “bulk” level analyses, thus allowing more refined understanding of cellular heterogeneity. For example, it can be used to cluster cells into sub-populations based on their differential gene expression, so that different fates of cells during development can be discovered. Droplet-based scRNA-seq (for example Drop-Seq [1] or 10X Genomics Systems [2]) allows profiling large numbers of cells for sequencing by dispersing liquid droplets in a continuous oil phase [3]s in an automated microfluidics system, and as a result is currently the most popular approach to scRNA-seq despite a high cost per run. Methods that lower the per sample cost of running scRNA-seq are required in order to scale this approach up to a population scale. An effective method for lowering scRNA-seq cost is to pool samples prior to droplet-based barcoding with subsequent demultiplexing of sequence reads. Cell hashing [4] based on Cite-seq [5] is one such experimental approach to demultiplex pooled samples. This approach uses oligo-tagged antibodies to label cells prior to mixing, but use of these antibodies increases both the cost and sample preparation time per run. Moreover, it requires access to universal antibodies for organism of interest, thus limiting applicability at this stage to human and mouse. Alternatively, computational tools like demuxlet [6] have been developed to demultiplex cells from multiple individuals, although this requires additional genotyping information to assign individual cells back to their samples of origin. This limits the utility of demuxlet, as genotype data might not be available for different species; biological material may not be available to extract DNA; or the genetic differences between samples might be somatic in origin. Another issue for droplet-based scRNA-seq protocols is the presence of doublets, which occurs when two cells are encapsulated in same droplet and acquire the same barcode. The proportion of doublets increases with increasing number of cells barcoded in a run. It is imperative that these are flagged and removed prior to downstream analysis. Demuxlet [6] uses external genotype information to address this issue, and other tools have been developed to solve this issue based on expression data alone, including Scrublet [7] and Doubletfinder [8]. Here we introduce a simple, accurate, and efficient tool, mainly for droplet-based scRNA-seq, called "scSplit", which uses a hidden state model approach to demultiplex individual samples from mixed scRNA-seq data with high accuracy. Our approach does not require genotype information from the individual samples to demultiplex them, which also makes it suitable for applications where genotypes are unavailable or difficult to obtain. scSplit uses existing bioinformatics tools to identify putative variant sites from scRNA-seq data, then models the allelic counts to assign cells to clusters using an expectation-maximisation framework.

Results

Our new tool, scSplit for demultiplexing pooled samples from scRNA-seq data, only requires the FASTQ files obtained from single cell sequencing, together with a white-list of barcodes, while it does not require genotype data, nor a list of common variants if not available. Result data are available in https://github.com/jon-xu/scSplit_paper_data.

Simulation run showed high accuracy and efficiency of scSplit

We used a single scRNA-seq BAM file from Zheng et al. [2] as a template for simulation. Additionally, we took 32 samples from genoptype information used in figure 2 supplementary data of Kang et al. [6], as the source of multi-sample genotype likelihoods for simulation (see “Data simulation” in Methods). We ran simulation tests using our scSplit tool and used the distinguishing variants to identify the individual donor for each cluster. In order to assess the accuracy of the method, we calculated both the proportion of cells from each cluster which were correctly assigned to it among the true correct number of cells in each cluster (True Positive Rate or TPR), as well as the proportion of cells assigned to a cluster which were incorrect against the total assigned cells (false discovery rate or FDR). We also report the average TPR and average FDR. We obtained very high overall TPR (0.97) and low FDR (less than 1e −4) for from 2- to 32-mixed samples, with very accurate doublet predictions (Table 1, Fig. 1a). To test the limit of our tool on genotype difference, we downloaded three pairs of full sibling genotypes from the UK Biobank and simulated pooled samples by mixing one pair at a time, the average singlet TPR was beyond 0.87 (Table S2 in Additional file 2).

Table 1

Overview of accuracy and performance of scSplit on simulated mixed samples, with one CPU and 30GB RAM

Simulation	sim2	sim3	sim4	sim8	sim12	sim16	sim24	sim32
Mixed samples	2	3	4	8	12	16	24	32
Number of cells	12 383	12 383	12 383	12 383	12 383	12 383	12 383	12 383
Reads per cell	4 973	4 973	4 973	4 973	4 973	4 973	4 973	4 973
Informative SNVs	34 116	34 116	34 116	34 116	34 116	34 116	34 116	34 116
Assigning cells	41 min	41 min	46 min	47 min	1h54m	2h11	2h33	2h55
Singlet TPR	0.97	0.97	0.97	0.97	0.97	0.97	0.96	0.96
Singlet FDR	0	9E −5	9E −5	9E −5	0	0	5E −3	8E −3
Doublet TPR	0.997	0.997	0.997	0.997	0.997	0.997	0.995	0.997
Doublet FDR	0	0	0	0	0	0	0	0
Cohen’s Kappa	1.0	1.0	1.0	1.0	1.0	1.0	0.97	0.98

We used PBMC donor B [2] and genotype data from demuxlet [6] as simulation templates

Fig. 1

Results on simulated, merged hash-tagged scRNA-seq datasets confirmed scSplit a useful tool to demultiplex pooled single cells. a Confusion matrix showing scSplit demultiplexing results on simulated 2-, 3-, 4- and 8-mix; b TPR and FDR of for singlets and doublets predicted by scSplit and demuxlet compared to known truth before merging; c TPR and FDR of for singlets and doublets predicted by scSplit and demuxlet compared to cell hashing tags

scSplit performed similarly well to demuxlet in demultiplexing merged individually sequenced three stromal samples

We then tried running scSplit on a manual merging of three individually sequenced samples. We merged the BAM files from three individual samples (Methods). In order to create synthetic doublets, we randomly chose 500 barcodes whose reads were merged with another 500 barcodes. We ended up with 9067 singlets and 500 doublets, knowing their sample origins prior to merging. Both scSplit and demuxlet [6] pipelines were run on the merged samples, and the results were compared with the known individual sample data. We observed high concordance of singlet prediction between both tools (TPR/FDR: 0.94/0.02 vs 0.93/0.02), and a better doublet prediction from scSplit compared to demuxlet (TPR/FDR: 0.65/0.04 vs 0.66/0.47) (Fig. 1b and Table 2). We then downsampled the mixed sample to 2800 reads per cell in order to test the performance under low sequencing depth and the overall result was still good (TPR = 0.91, FDR = 0.03), which indicated that scSplit can work under shallow read depth.

Table 2

Comparison of scSplit and demuxlet performance in demultiplexing merged three individually genotyped stromal samples (TPR true positive rate, FDR false discovery rate); Total cell numbers: 9567; Reads per cell: 14,495; Informative SNVs: 63,129; Runtime for matrices building: 67 min, Runtime for cell assignment: 55 min

Predictions vs Truth		TPR	FDR	Cohen’s Kappa
scSplit	Singlet	0.94	0.02	0.95
	Doublet	0.65	0.04
demuxlet	Singlet	0.93	0.02	0.77
	Doublet	0.66	0.47

scSplit predictions highly consistent with known source of hashtagged and pooled eight PBMC samples

Next, we tested scSplit on a published scRNA-seq dataset (GSE108313) which used cell-hashing technology to mark samples of the cells before multiplexing [4]. We ran through the scSplit pipeline with the SNVs filtered by common SNVs provided on The International Genome Sample Resource (IGSR) [9]. According to the scSplit pipeline, distinguishing variants were identified, and the P/A matrix was generated to assign the cells to clusters (Methods). We then extracted the reference and alternative allele absence information at these distinguishing variants from the sample genotypes and generated a similar P/A matrix. Both matrices were compared so that clusters were mapped to samples (Figure S1 in Additional file 1). Our results were highly consistent with the known cell hashing tags (Table 3). We saw higher TPR for singlets in scSplit (0.98) than demuxlet (0.79) and similar singlet FDRs (0.10 vs 0.13). Although the doublet TPR of scSplit (0.35) was lower than for demuxlet (0.65), the doublet FDR (0.28) was better than demuxlet (0.46). If the expected number of doublets was selected higher, cells with largest read depth could be moved from singlet clusters to the doublet cluster to increase the TPR for doublets with a decrease of TPR for singlets.

Table 3

Predictions vs Truth		TPR	FDR	Cohen’s Kappa
scSplit	Singlet	0.98	0.13	0.75
	Doublet	0.35	0.28
demuxlet	Singlet	0.79	0.10	0.74
	Doublet	0.65	0.46

Comparison of scSplit and demuxlet performance in demultiplexing hashtagged and multiplexed eight individually genotyped PBMC samples (TPR true positive rate, FDR false discovery rate); total cell numbers: 7932; reads per cell: 5835; informative SNVs: 16,058; runtime for matrices building: 35 min, runtime for cell assignment: 20 min We also compared the performance of overall P/A genotyping matrices generated based on scSplit and demuxlet predictions against that from the known genotypes (“Alternative allele presence/absence genotyping for clusters” section of Methods). The results show that genotypes inferred from both scSplit and demuxlet predictions have good concordance with sample genotypes (Table S1 in Additional file 2).

Comparing scSplit with demuxlet on more pooled scRNA-seq samples

We ran scSplit with common SNV filtering on published data from the demuxlet paper [6]. By taking demuxlet predictions as ground truth, we achieved high singlet TPR (0.80), although the doublet prediction of the two tools were quite distinct to each other (Fig. 2a and Table 4).

Fig. 2

Table 4

Comparison of scSplit and demuxlet performance in demultiplexing multiplexed eight individually genotyped PBMC samples (TPR true positive rate, FDR false discovery rate); total cell numbers: 6145; reads per cell: 33,119; informative SNVs: 22,757; runtime for matrices building: 45 min; runtime for cell assignment: 35 min

scSplit vs demuxlet	TPR	FDR	Cohen’s Kappa
Singlets	0.80	0.18	0.63
Doublets	0.12	0.92

Results of scSplit on pooled PBMC scRNA-seq and that on a set of pooled fibroblast samples. a Singlet TPR and FDR compared to demuxlet predictions on pooled PBMC scRNA-seq. b Violin plot of singlet TPR and FDR for five 7- or 8-mixed samples based on scSplit vs demuxlet Comparison of scSplit and demuxlet performance in demultiplexing multiplexed eight individually genotyped PBMC samples (TPR true positive rate, FDR false discovery rate); total cell numbers: 6145; reads per cell: 33,119; informative SNVs: 22,757; runtime for matrices building: 45 min; runtime for cell assignment: 35 min We also ran our tool on a set of genotyped and then pooled fibroblast scRNA-seq datasets. Predictions from scSplit and demuxlet showed high concordance in singlet prediction (TPR: 0.93–0.94, FDR: 0.06–0.07), although not on doublets (TPR: 0.08–0.52, FDR: 0.45–0.92) when demuxlet was treated as gold standard (Fig. 2b and Table 5). Mapping between clusters and samples were recorded (Figure S2 in Additional file 1).

Table 5

Overview of accuracy and performance running scSplit on five multiplexed scRNA-seq datasets, with one CPU and 30 GB RAM

scSplit vs demuxlet	Mix 1	Mix 2	Mix 3	Mix 4	Mix 5
Mixed samples	7	8	8	8	7
Number of cells	914	8 137	5 165	6 977	7 428
Reads per cell	86 148	16 386	21 265	18 572	19 657
informative SNVs	15 848	26 830	26 162	23 224	41 993
Build matrices	10 min	23 min	18 min	21 min	35 min
Assign cells	4 min	47 min	23 min	45 min	50 min
Singlet TPR	0.94	0.93	0.94	0.93	0.93
Singlet FDR	0.06	0.07	0.06	0.07	0.07
Doublet TPR	0.52	0.17	0.15	0.17	0.08
Doublet FDR	0.48	0.83	0.85	0.83	0.92
Cohen’s Kappa	0.86	0.78	0.68	0.77	0.76

(TPR: True Positive Rate; FDR: False Discovery Rate)

Overview of accuracy and performance running scSplit on five multiplexed scRNA-seq datasets, with one CPU and 30 GB RAM (TPR: True Positive Rate; FDR: False Discovery Rate)

Pooling samples together showed similar effects as normalizing individually sequenced samples

We further checked the gene expression profiles of the previously illustrated three individual stromal samples (Fig. 1b and Table 2). We plotted Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) [10] for non-pooled and pooled scenarios with and without normalization. The samples were more separated from each other in non-pooled and non-normalized scenario (Fig. 3a), and got less distant for other scenarios including non-pooled but normalized (Fig. 3b), pooled and non-normalized (Fig. 3c), and pooled and normalized (Fig. 3d). We calculated Silhouette values for each of the UMAPs and got 0.28 for Fig. 3a, 0.12 for Fig. 3b, 0.14 for Fig. 3c, and 0.19 for Fig. 3d. As bigger Silhouette values indicate larger difference between samples, we could say both normalization and pooling could reduce the batch effects between individually sequenced samples. However, by pooling samples together for sequencing could minimize the potential information loss during normalization.

Fig. 3

Batch effect during sequencing runs found in comparison of individual runs was obvious compared to that in pooled scRNA-seq data. a UMAP for three individually sequenced samples. b UMAP for three individually sequenced and normalized samples. c UMAP for pooled sequencing of same three individual samples, samples marked based on demultiplexing results using scSplit. d UMAP for pooled sequencing of same three individual samples, normalized by total sample reads

Discussion

We developed the scSplit toolset to facilitate accurate, cheap, and fast demultiplexing of mixed scRNA samples, without needing sample genotypes prior to mixing. scSplit also generates a minimum set of alleles (as few as the sample numbers), enabling researchers to link the resulting clusters with the actual samples by comparing the allele presence at these distinguishing loci. When predefined individual genotypes are not available as a reference, this can be achieved by designing a simple assay focused on these distinguishing variants (such as a Massarray or multiplexed PCR assay). Although the tool was mainly designed for droplet-based scRNA-seq, it can also be used for scRNA-seq data generated from other types of scRNA-seq protocols. We filtered out indels, MNPs, and complex rearrangements when building the model and were able to show that SNVs alone provide adequate information to delineate the differences between multiple samples. As an alternative to using allele fractions to model multiple samples, genotype likelihoods could also be used for the same purpose; however, more memory and running time would likely be needed, especially when barcode numbers in mixed sample experiments increase. Our tests showed no discernable difference in accuracy between these two methods. The current version of scSplit assumes that the number of mixed samples is known. It is possible to run the scSplit tool for different sample numbers and compare the model log-likelihoods to select the most likely number of samples being modeled, but this would require significant computational resources and time. Alternatively, the reference and alternative allele counts in different samples and the size of the doublet cluster could be used to determine the sample number. Further optimization of the tool would be needed to effectively implement these options. Although scSplit was mainly tested on human samples, it can also be applied to other organisms and is especially useful for those species without dense genotyping chips available. We also expect the application of scSplit in cancer -related studies, to distinguish tumor cells from healthy cells, as well as to distinguish tumor sub-clones.

Conclusions

scSplit is an accurate, fast, and computationally efficient method with which to conduct demultiplexing of individual cells from pooled samples of scRNA-seq. In the next version, we plan to enable auto-detection of the mixed sample number, which will help to broaden the application of our tool to more biological and medical research areas, including but not limited to, distinguishing mixed infections, delineating tumor sub-clones and sequence analysis in non-model organisms.

Methods

All relevant source code is available at https://github.com/jon-xu/scSplit/.

Overview

The overall pipeline for the scSplit tool includes seven major steps (Fig. 4):

Fig. 4

The overall pipeline of scSplit tool. a SNV identified based on reads from all cells which have similar or different genotypes. b Alternative and reference allele count matrices built from each read in the pooled-sequenced BAM at the identified informative SNVs. c Initial allele fraction model constructed from the initial cell seeds and their allele counts. d Expectation-maximization process to find the most optimized allele fraction model, based on which the cells are assigned to clusters. e Presence/Absence matrix of alternative alleles generated from the cell assignments. f Minimum set of distinguishing variants found to be used to map clusters with samples

Data quality control and filtering: The mixed sample BAM file is first filtered to keep only the reads with a list of valid barcodes to reduce technical noise. Additional filtering is then performed to remove reads that meet any of the following: mapping quality score less than 10, unmapped, failing quality checks, secondary or supplementary alignment, or PCR or optical duplicate. The BAM file is then marked for duplication, sorted and indexed. The overall pipeline of scSplit tool. a SNV identified based on reads from all cells which have similar or different genotypes. b Alternative and reference allele count matrices built from each read in the pooled-sequenced BAM at the identified informative SNVs. c Initial allele fraction model constructed from the initial cell seeds and their allele counts. d Expectation-maximization process to find the most optimized allele fraction model, based on which the cells are assigned to clusters. e Presence/Absence matrix of alternative alleles generated from the cell assignments. f Minimum set of distinguishing variants found to be used to map clusters with samples SNV calling (Fig. 4a): Freebayes v1.2 [11] is used to call SNVs on the filtered BAM file, set to ignore insertions and deletions (indels), multi-nucleotide polymorphisms (MNPs), and complex events. A minimum base quality score of one and minimum allele count of two is required to call a variant. The output VCF file is further filtered to keep only SNVs with quality scores greater than 30. Building allele count matrices (Fig. 4b): The “matrices.py” script is run which produces two.csv files, one for each of reference and alternate allele counts as output. Model initialization (Fig. 4c): find the distinct groups of cells in the scRNA-seq and use them to initialize the Allele Fraction Model (SNVs by samples). E-M iterations till convergence (Fig. 4d): Initialized allele fraction model and the two allele count matrices are used together to calculate the probability of each cell belonging to the clusters. After each round, allele fraction model is updated based on the probability of cell assignment and this is iterated until overall likelihood of the model reaches convergence. Alternative presence/absence genotypes (Fig. 4e): matrix indicating cluster genotypes at each SNV is built in this step. Find distinguishing variants for clusters and use to assign samples to clusters (Fig. 4f): In order to assign each model cluster back to the specific sample, distinguishing variants are identified so that genotyping of the least number of loci using the a suitable platform may be performed. Gram-Schmidt orthogonalization [12] is used to get the minimum set of informative P/A genotypes.

Data quality control

Samtools was used to filter the reads with verified barcodes for mapping and alignment status, mapping quality, and duplication (samtools view -S -bh -q 10 -F 3844 [input] >[output]). Duplicates were removed (samtools rmdup [input] [output]) followed by sorting and indexing.

SNV calling on scRNA-seq dataset

SNVs were called on the scRNA-seq mixed sample BAM file with freebayes [11], a widely used variant calling tool. The freebayes arguments “-iXu -q 1” were set to ignore indels and MNPs and exclude alleles with supporting base quality scores of less than one. This generated a VCF file containing all SNVs from the mixed sample BAM file. Common SNPs of a population (for example results from The international Genome Sample Resource [9]) were recommended be used to filter out noisy SNVs.

Building allele count matrices

Allele count matrices were then built from (1) the provided mixed sample BAM file and (2) the VCF file obtained from the SNV calling program. Two allele count matrices were generated, one for the reference alleles and one for the alternate alleles, each with SNVs in rows and barcodes in columns. Each data element in the matrix indicated either the number of reference or alternate alleles detected in one cell barcode at that specific SNV position. This provided a full map of the distribution of reference and alternate alleles across all barcodes at each SNV. The allele count matrices captured information from all reads overlapping SNVs to reflect the different allele fraction patterns from different barcodes or samples. To build the allele count matrices, pysam fetch [13] was used to extract reads from the BAM file. The reads overlapping each SNV position were fetched and counted for the presence of the reference or alternate allele. In order to increase overall accuracy and efficiency, SNVs whose GL(RA)(likelihood of heterozygous genotypes) was lower than log10(1−error) where error = 0.01 were filtered out. These were more homozygous and thus less informative for detecting the differences between the multiple samples. The generated matrices were exported

Model initialization by using maximally informative cluster representatives

To initialize the model, initial probabilities of observing an alternative allele on each SNV position in each cluster were calculated. The overall matrix was sparse and a dense sub-matrix with a small number of zero count cells was generated. To do that, cells were first sorted according to their number of zero allele counts (sum of reference and alternative alleles) at all SNVs and SNVs were similarly sorted according to their number of zero allele counts (sum of reference and alternative alleles) across all cells. Next, we selected and filtered out 10% of the cells among those with the most number of zero expressed SNVs and 10% of the SNVs among those where the most number cells had zero counts. This was repeated until all remaining cells had more than 90% of their SNVs with non-zero allele counts and all SNVs had non-zero counts in more than 90% of cells. This subset of matrices was the basis for the seed barcodes to initialize the whole model. The sub-matrix was transformed using PCA with reduced dimensions and then K-means clustering was performed to split the cell subset into expected number of clusters. By using the allele fractions on the subset of SNVs in these initially assigned cells, each cluster of the model could be initialized. Let N(A) and N(R) be the Alternative and Reference allele counts on SNV v and cell c accordingly, and let pseudo R be the pseudo allele count for both Alternative and Reference alleles, and pseudo be the pseudo allele count for Alternative alleles, we calculated P(A|S), the probability of observing Alternative allele on SNV v in Sample n, according to below equation: We also initiated the probability of seeing the nth sample as evenly distributed across all samples. Let P(S) be the probability of seeing the n-th sample, and N(S) be the number of samples to be demultiplexed:

Expectation–maximization approach

The expectation-maximization (EM) algorithm [14] was used to conduct iterations using the full allele count matrices (Fig. 4). Each iteration consisted of an E-step to calculated the probability of seeing cells in all clusters, based on the allele fraction model, and an M-step to use the new probability of seeing cells in all clusters to update the allele fraction model. EM iterations stopped when convergence was reached, so that the overall probability of observing the cells, or the reference/alternative alleles count matrices, was maximized. During the E-step, the tool first calculated P(C|S), the likelihood of observing a cell C in sample S, which was equal to the product of the probability of observing the allele fraction pattern over each SNV, which in turn equaled to the product of probability of having observed the count of alternative alleles and probability of having observed the count of reference alleles. Let c be the i-th cell, S be the n-th sample, A be the Alternative allele on SNV v, and N(A), N(R) be the quantity of Alternative and Reference alleles: And then P(C|S) was transformed to P(S|C), the cell-sample probability, i.e. the probability of a cell C belonging to sample S, using Bayes’ theorem, assuming equal sample prior probabilities (P(S1) = P(S2) =... = P(S)): Next, weighted allele counts were distributed to the different cluster models according to the cell-sample probability, followed by the M-step, where the allele fraction model represented by the alternative allele fractions was updated using the newly distributed allele counts, so that allele fractions at all SNV positions in each sample model were recalculated: And the sample probability P(S) was also updated by the newly calculated cell likelihoods: The overall log-likelihood of the whole model [15] was calculated as:

Multiple runs to avoid local maximum likelihood

The entire process was repeated for 30 rounds with the addition of randomness during model initialization and the round with the largest sum of log likelihood was taken as the final result. Randomness was introduced by randomly selecting the 10% of cells and SNVs to be removed from the matrices during initialization from a range of the lowest ranked cells and SNVs as detailed previously.

Cell cluster assignment

Next, probability of a cell belonging to a cluster P(S|C) was calculated. Cells were assigned to a cluster based on a minimum threshold of P > 0.99. Those cells with no P(S|C) larger than the threshold were regarded as unassigned.

Handling of doublets

During scRNA-seq experiments, a small proportion of droplets can contain cells from more than one sample. These so called doublets, contain cells from same or different samples sharing the same barcode, which if not addressed would cause bias. Our model took these doublets into consideration. During our hidden state based demultiplexing approach, we included an additional cluster so that doublets could be captured. To identify which cluster in the model was the doublet cluster in each round, the sum of log-likelihood of cross assignments was checked: The sum log-likelihood of cells from all other clusters being assigned to a specific cluster was calculated for each cluster in turn and compared. The cluster with the largest sum log-likelihood of cross assignment was designated as the doublet cluster. We allow user input on the expected proportion of doublets. If the expected number of doublets was larger than those detected in the doublet cluster, cells with largest read depth were moved from singlet clusters to doublet cluster, so that the total number of doublets meet expectation as input.

Alternative allele presence/absence genotyping for clusters

To identify a minimum set of variants, which can distinguish between sample clusters, we generated alternative allele P/A genotype matrix (SNVs by clusters). To do that, sum of reference and alternate allele counts across all cells assigned to each cluster were calculated. And for each SNV and each cluster, “P” was marked if there were more than 10 alternative allele counts, and “A” for more than 10 reference allele counts but no alternative allele count. "NA" was set if neither criterion was met.

Mapping clusters back to individual samples using minimal set of P/A genotypes

Based on the P/A matrix, we started from informative SNVs which had variations of “P” or “A” across clusters and avoid picking those with “NAs”. Then, unique patterns involved in those SNVs were derived and for each unique P/A pattern, one allele was selected to subset the whole matrix. Next, Gram-Schmidt orthogonalization [12] was applied on the subset of P/A matrix, in order to find the variants which can be basis vectors to effectively distinguish the clusters. If not enough SNVs were found to distinguish all the clusters, the clusters were split into smaller groups so that for each group there was enough variants to distinguish the clusters within that group. And to distinguish clusters from different smaller groups, if the selected variants could not be used to distinguish any pair of clusters, additional variants were selected from the whole list of variants where no NAs were involved and P/A was different between the pair of clusters. Ideal situation was N variants for N clusters, but it was possible that >N variants were needed to distinguish N clusters. As such, the P/A genotyping of each cluster, on the minimum set of distinguishing variants, could be used as a reference to map samples to clusters. After running genotyping on this minimum set of loci for each of the individual samples, a similar matrix based on sample genotypes could be generated, by setting the alternative presence flag when genotype probability (GP) was larger than 0.9 for RA or AA, or absence flag when GP was larger than 0.9 for RR. By comparing both P/A matrices, we could link the identified clusters in scSplit results to the actual individual samples. In practice, samples can be genotyped only on the few distinguishing variants, so that scSplit-predicted clusters can be mapped with individual samples, while the whole genotyping is not needed. When the whole genotyping is available, we also provide an option for users to generate distinguishing variants only from variants with R2 >0.9, so that they can compare the distinguishing matrix from scSplit with that from known genotypes on more confident variants.

Data simulation

To test the consistency of the model, and the performance of our demultiplexing tool, reference/alternative count matrices were simulated from a randomly selected scRNA-seq BAM file from Zheng et al. [2] and a 32-sample VCF file used in Fig. 2 supplementary data of Kang et al. [6]. We assume the randomly selected BAM file had a representative gene expression profile. First, data quality was checked and the BAM and VCF files were filtered. Second, barcodes contained in the BAM file were randomly assigned to samples in the VCF file, which gave us the gold-standard of cell-sample assignments to check against after demultiplexing. Then, all the reads in the BAM file were processed, that if a read overlapped with any SNV position contained in the merged VCF file, its barcode was checked to get its assigned sample and the probability P(A) of having the alternative allele for that sample was calculated using the logarithm-transformed genotype likelihood (GL) or genotype probability (GP) contained in the VCF file. The probability of an allele being present at that position could then be derived so that the ALT/REF count at the SNV/barcode in the matrices could be simulated based on the alternative allele probability. Let and be the likelihood of seeing AA and RA of a certain cell c on a certain SNV v: Finally, doublets were simulated by merging randomly chosen 3% barcodes with another 3% without overlapping in the matrix. This was repeated for every single read in the BAM file. This simulation modeled the number of reads mapped to the reference and alternative alleles directly. In our simulations, there were 61 576 853 reads in the template BAM file for 12 383 cells, which was equivalent to 4973 rpc. With the simulated allele fraction matrices, the barcodes were demultiplexed using scSplit and the results were compared with the original random barcode sample assignments to validate.

Result evaluation

We used both TPR/FDR and Cohen’s Kappa [16] to evaluate the demultiplexing results against ground truth. R package “cluster” [17] was used in evaluating the clusters on UMAPs in Fig. 3.

Single cell RNA-seq data used in testing scSplit

In Tables 3 and 4, we used published hashtagged data from GSE108313 and PBMC data from GSE96583. For Tables 2 and 5, endometrial stromal cells cultured from 3 women and fibroblast cells cultured from 38 healthy donors over the age of 18 years respectively were run through the 10x Genomics Chromium 3’ scRNA-seq protocol. The libraries were sequenced on the Illumina Nextseq 500. FASTQ files were generated and aligned to Homo sapiens GRCh38p10 using Cell Ranger. Individuals were genotyped prior to pooling using the Infinium PsychArray.

Full sibling data from UK biobank used in simulation

In Table S2 in Additional file 2, we used genotype data of three pairs of full siblings from UK Biobank, which contained 564 981 SNVs, from which we used 258 077 SNVs within gene ranges, provided on the resource website of plink [18]: https://www.cog-genomics.org/plink/1.9/resources. Additional file 1 Figure S1. Illustration of presence absence matrices calculated on pooled and hashtagged scRNA-seq datasets. Figure S2. Illustration of presence absence matrices calculated on pooled fibroblast scRNA-seq datasets. Additional file 2 Table S1. Accuracy of alternative allele Presence/Absence genotypes built from scSplit/demuxlet clusters compared with that from sample genotyping, based on Hashtag scRNA-seq dataset. Table S2. Simulation using full sibling genotypes from UK Biobank shows scSplit can work for very closely related pooled samples. Additional file 3 Review history.

13 in total

1. PLINK: a tool set for whole-genome association and population-based linkage analyses.

Authors: Shaun Purcell; Benjamin Neale; Kathe Todd-Brown; Lori Thomas; Manuel A R Ferreira; David Bender; Julian Maller; Pamela Sklar; Paul I W de Bakker; Mark J Daly; Pak C Sham
Journal: Am J Hum Genet Date: 2007-07-25 Impact factor: 11.025

2. What is the expectation maximization algorithm?

Authors: Chuong B Do; Serafim Batzoglou
Journal: Nat Biotechnol Date: 2008-08 Impact factor: 54.908

3. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets.

Authors: Evan Z Macosko; Anindita Basu; Rahul Satija; James Nemesh; Karthik Shekhar; Melissa Goldman; Itay Tirosh; Allison R Bialas; Nolan Kamitaki; Emily M Martersteck; John J Trombetta; David A Weitz; Joshua R Sanes; Alex K Shalek; Aviv Regev; Steven A McCarroll
Journal: Cell Date: 2015-05-21 Impact factor: 41.582

4. Dimensionality reduction for visualizing single-cell data using UMAP.

Authors: Etienne Becht; Leland McInnes; John Healy; Charles-Antoine Dutertre; Immanuel W H Kwok; Lai Guan Ng; Florent Ginhoux; Evan W Newell
Journal: Nat Biotechnol Date: 2018-12-03 Impact factor: 54.908

5. Comparative Analysis of Droplet-Based Ultra-High-Throughput Single-Cell RNA-Seq Systems.

Authors: Xiannian Zhang; Tianqi Li; Feng Liu; Yaqi Chen; Jiacheng Yao; Zeyao Li; Yanyi Huang; Jianbin Wang
Journal: Mol Cell Date: 2018-11-21 Impact factor: 17.970

6. Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data.

Authors: Samuel L Wolock; Romain Lopez; Allon M Klein
Journal: Cell Syst Date: 2019-04-03 Impact factor: 10.304

7. DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors.

Authors: Christopher S McGinnis; Lyndsay M Murrow; Zev J Gartner
Journal: Cell Syst Date: 2019-04-03 Impact factor: 10.304

8. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation.

Authors: Hyun Min Kang; Meena Subramaniam; Sasha Targ; Michelle Nguyen; Lenka Maliskova; Elizabeth McCarthy; Eunice Wan; Simon Wong; Lauren Byrnes; Cristina M Lanata; Rachel E Gate; Sara Mostafavi; Alexander Marson; Noah Zaitlen; Lindsey A Criswell; Chun Jimmie Ye
Journal: Nat Biotechnol Date: 2017-12-11 Impact factor: 54.908

9. Simultaneous epitope and transcriptome measurement in single cells.

Authors: Marlon Stoeckius; Christoph Hafemeister; William Stephenson; Brian Houck-Loomis; Pratip K Chattopadhyay; Harold Swerdlow; Rahul Satija; Peter Smibert
Journal: Nat Methods Date: 2017-07-31 Impact factor: 28.547

10. Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics.

Authors: Marlon Stoeckius; Shiwei Zheng; Brian Houck-Loomis; Stephanie Hao; Bertrand Z Yeung; William M Mauck; Peter Smibert; Rahul Satija
Journal: Genome Biol Date: 2018-12-19 Impact factor: 13.583

15 in total

Review 1. Practical Considerations for Single-Cell Genomics.

Authors: Claire Regan; Jonathan Preall
Journal: Curr Protoc Date: 2022-08

Review 2. Sample-multiplexing approaches for single-cell sequencing.

Authors: Yulong Zhang; Siwen Xu; Zebin Wen; Jinyu Gao; Shuang Li; Sherman M Weissman; Xinghua Pan
Journal: Cell Mol Life Sci Date: 2022-08-05 Impact factor: 9.207

3. propeller: testing for differences in cell type proportions in single cell data.

Authors: Belinda Phipson; Choon Boon Sim; Enzo R Porrello; Alex W Hewitt; Joseph Powell; Alicia Oshlack
Journal: Bioinformatics Date: 2022-10-14 Impact factor: 6.931

4. Single-nucleus cross-tissue molecular reference maps toward understanding disease gene function.

Authors: Gökcen Eraslan; Eugene Drokhlyansky; Shankara Anand; Evgenij Fiskin; Ayshwarya Subramanian; Michal Slyper; Jiali Wang; Ayellet V Segrè; François Aguet; Orit Rozenblatt-Rosen; Kristin G Ardlie; Aviv Regev; Nicholas Van Wittenberghe; John M Rouhana; Julia Waldman; Orr Ashenberg; Monkol Lek; Danielle Dionne; Thet Su Win; Michael S Cuoco; Olena Kuksenko; Alexander M Tsankov; Philip A Branton; Jamie L Marshall; Anna Greka; Gad Getz
Journal: Science Date: 2022-05-13 Impact factor: 63.714

Review 5. Guidelines for bioinformatics of single-cell sequencing data analysis in Alzheimer's disease: review, recommendation, implementation and application.

Authors: Minghui Wang; Won-Min Song; Chen Ming; Qian Wang; Xianxiao Zhou; Peng Xu; Azra Krek; Yonejung Yoon; Lap Ho; Miranda E Orr; Guo-Cheng Yuan; Bin Zhang
Journal: Mol Neurodegener Date: 2022-03-02 Impact factor: 18.879

6. Estimating the Allele-Specific Expression of SNVs From 10× Genomics Single-Cell RNA-Sequencing Data.

Authors: Prashant N M; Hongyu Liu; Pavlos Bousounis; Liam Spurr; Nawaf Alomran; Helen Ibeawuchi; Justin Sein; Dacian Reece-Stremtan; Anelia Horvath
Journal: Genes (Basel) Date: 2020-02-25 Impact factor: 4.096

7. No detectable alloreactive transcriptional responses under standard sample preparation conditions during donor-multiplexed single-cell RNA sequencing of peripheral blood mononuclear cells.

Authors: Nadia R Roan; Sulggi A Lee; Christopher S McGinnis; David A Siegel; Guorui Xie; George Hartoularos; Mars Stone; Chun J Ye; Zev J Gartner
Journal: BMC Biol Date: 2021-01-20 Impact factor: 7.364

8. Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action.

Authors: James M McFarland; Brenton R Paolella; Allison Warren; Kathryn Geiger-Schuller; Tsukasa Shibue; Michael Rothberg; Olena Kuksenko; William N Colgan; Andrew Jones; Emily Chambers; Danielle Dionne; Samantha Bender; Brian M Wolpin; Mahmoud Ghandi; Itay Tirosh; Orit Rozenblatt-Rosen; Jennifer A Roth; Todd R Golub; Aviv Regev; Andrew J Aguirre; Francisca Vazquez; Aviad Tsherniak
Journal: Nat Commun Date: 2020-08-27 Impact factor: 14.919

9. The single-cell eQTLGen consortium.

Authors: Mgp van der Wijst; D H de Vries; H E Groot; G Trynka; C C Hon; M J Bonder; O Stegle; M C Nawijn; Y Idaghdour; P van der Harst; C J Ye; J Powell; F J Theis; A Mahfouz; M Heinig; L Franke
Journal: Elife Date: 2020-03-09 Impact factor: 8.140

Review 10. Needle in a Haystack: The Naïve Repertoire as a Source of T Cell Receptors for Adoptive Therapy with Engineered T Cells.

Authors: Elvira D'Ippolito; Karolin I Wagner; Dirk H Busch
Journal: Int J Mol Sci Date: 2020-11-06 Impact factor: 5.923