Literature DB >> 34252961

doubletD: detecting doublets in single-cell DNA sequencing data.

Leah L Weber¹, Palash Sashittal^1,2, Mohammed El-Kebir¹.

Abstract

MOTIVATION: While single-cell DNA sequencing (scDNA-seq) has enabled the study of intratumor heterogeneity at an unprecedented resolution, current technologies are error-prone and often result in doublets where two or more cells are mistaken for a single cell. Not only do doublets confound downstream analyses, but the increase in doublet rate is also a major bottleneck preventing higher throughput with current single-cell technologies. Although doublet detection and removal are standard practice in scRNA-seq data analysis, options for scDNA-seq data are limited. Current methods attempt to detect doublets while also performing complex downstream analyses tasks, leading to decreased efficiency and/or performance.
RESULTS: We present doubletD, the first standalone method for detecting doublets in scDNA-seq data. Underlying our method is a simple maximum likelihood approach with a closed-form solution. We demonstrate the performance of doubletD on simulated data as well as real datasets, outperforming current methods for downstream analysis of scDNA-seq data that jointly infer doublets as well as standalone approaches for doublet detection in scRNA-seq data. Incorporating doubletD in scDNA-seq analysis pipelines will reduce complexity and lead to more accurate results.
AVAILABILITY AND IMPLEMENTATION: https://github.com/elkebir-group/doubletD. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: CellLine Chemical Disease Gene Species

Year: 2021 PMID： 34252961 PMCID： PMC8275324 DOI： 10.1093/bioinformatics/btab266

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The increased use of single-cell sequencing for cancer research is providing a wealth of new insights regarding intratumor heterogeneity, metastasis and the landscape of the tumor microenvironment (Gawad ; Lim ; Miles ; Morita ). In particular, the ongoing improvement in single-cell DNA sequencing (scDNA-seq) assays is rapidly advancing methods for reconstructing the evolutionary history of a tumor (El-Kebir, 2018; Jahn ; Ross and Markowetz, 2016; Roth ; Satas ; Zafar ). While scDNA-seq is more labor intensive and error-prone than traditional bulk DNA sequencing (Pellegrino ), scDNA-seq permits the observation of mutation co-occurrence patterns within a single cell, yielding both higher fidelity tumor phylogeny reconstructions and more accurate identification of a set of distinct tumor clones or genotypes. The smaller amount of DNA material within a cell compared to RNA poses additional sequencing challenges than those faced in single-cell RNA-sequencing (scRNA-seq) (De Bourcy ). Medium to high coverage scDNA-seq technology, suitable for detecting single-nucleotide variants, suffers from elevated rates of technical errors due to whole-genome amplification that may impact downstream analyses, including allelic dropout (ADO), copying mistakes in the amplification reaction, unbalanced amplification and doublets. Specifically, when ADO occurs, one or more of the alleles may fail to be amplified during the early stages of the process and thus the allele is said to ‘drop out’ prior to sequencing. While technological advances have decreased the frequency of these errors, one remaining technical challenge is when multiple cells, or multiplets, are captured within a droplet and linked to a single barcode making all subsequent reads appearing as if they originated from one cell. To mitigate this effect, practitioners utilize a Poisson distribution to estimate the probability that a droplet contains a specified number of cells. The rate parameter of the Poisson distribution is then determined by a function of the cell solution concentration and droplet volume to obtain the desired probability of multiplets (Liu ). This results in the majority of droplets containing zero cells and multiplets with more than two cells are rare. However, doublets, which are droplets containing two cells, occur frequently and are therefore the focus of this work (Kuipers ; Navin and Chen, 2016; Zafar ). Adapting terminology from the scRNA-seq literature (Wolock ), we introduce three categories for doublets in scDNA-seq: (i) selflet, (ii) nested and (iii) neotypic (Fig. 1a). Selflets are comprised of cells with identical genotypes. Nested doublets occur when the set of mutations in one cell is a proper subset of the mutations in the other cell. A neotypic doublet is a doublet that is not nested or a selflet and implies the existence of a novel genotype not present in the sample. Neotypic doublets thus distort the signal of mutation co-occurrence patterns and makes it challenging to distinguish the presence of rare clones, that may be resistant to certain treatments, from a neotypic doublet (Pellegrino ). Although nested doublets and selflets will not impact the analysis of mutation co-occurrence or mutual exclusivity patterns, they may impact the estimation of clonal abundances, which are used to model both the evolutionary trajectory and the fitness landscape of a tumor (Miles ; Salehi ).

Fig. 1.

doubletD calls doublets in medium to high coverage scDNA-seq data. (a) The first step of most single-cell sequencing technologies involves cell capture where the goal is to encapsulate single cells into droplets, known as singlets. However, errors in this process (details in Section 1) can lead to three kind of doublets—neotypic doublets, nested doublets and selflets. (b) The cells in each isolated droplet i undergo whole-genome amplification and sequencing independently. These processes introduce errors such as ADOs and imbalance in amplification. (c) The resulting aligned reads are used for variant calling yielding alternate and total read counts at each locus of interest j. (d) doubletD uses the observed variant allele frequencies as the key signal, while accounting for sequencing and amplification errors to detect doublets in the sample. The symbol denotes element-wise division While there are downstream analysis methods, such as genotype and/or phylogeny inference methods, that account for the presence of doublets, to the best of our knowledge, there exists no standalone method for doublet detection in scDNA-seq data. There are a number of drawbacks to methods that jointly infer the doublets during any downstream analysis. First, methods like SCITE (Kuipers ), SCG (Roth ) and SiCloneFit (Zafar ) utilize Bayesian inference in the form of Markov chain Monte Carlo (MCMC) or variational inference, which scale poorly with the inclusion of doublets and size of the input (Kuipers ; Roth ; Zafar ). Second, methods, such as ScisTree (Wu, 2020), are able to identify doublets only under the infinite sites model of evolution. Third, most methods require a binarized or discretized experiment by loci matrix input as opposed to positional variant and reference allele read counts. This results in the loss of useful information for doublet identification. Lastly, as a result of the discrete input and/or utilizing the infinite sites assumption, methods that do identify doublets are at best only able to identify neotypic doublets. In contrast, there exist a number of standalone methods for detecting doublets in single-cell RNA-sequencing data (DePasquale ; McGinnis ; Wolock ). See Xi and Li (2020) for an excellent overview and benchmarking of scRNA-seq doublet detection methods. Doublets in single-cell RNA-sequencing (scRNA-seq) result in the observation of neotypic gene expression profiles, which impacts cell clustering and the identification of cell-state trajectories (Xi and Li, 2020). In general, these methods follow a four-step process. First, simulated doublets are created by mixing observed gene expression profiles. Second, the observed and simulated data are embedded into a latent space using dimensionality reduction. Third, machine learning methods are used to estimate the probability that a droplet is a doublet. Finally, a threshold scheme is enacted based on knowledge of the experimental doublet rate to classify experiments as either a singlet or doublet. The main variation within these methods is the choice of embedding/dimension reduction and classifier. Additionally, these methods are designed to capture neotypic doublets and struggle to identify embedded doublets, which are often located within clusters of singlets in the embedded space. While it is possible to directly apply scRNA-seq doublet detection methods on DNA variant read counts, such methods do not properly account for the distinct error profile of scDNA-seq data. As a first step in addressing the need for a fast, standalone method for scDNA-seq doublet detection, we introduce doubletD, which performs doublet detection in medium to high coverage scDNA-seq data. Critically, doubletD does not make any assumptions about the model of evolution, the number of distinct clones or assume a threshold on the minimum clonal abundance in the sample. doubletD operates directly on variant and reference allele counts without the need to discretize the input, thus retaining a critical signal for doublet detection in the form of the variant allele frequency (VAF) (Fig. 1c). Specifically, underlying doubletD is the observation that doublets in scDNA-seq data have a characteristic VAF spectrum due to increased number of copies and/or ADO (Fig. 1d). Others have noted the presence of some of these characteristics in a post hoc analysis of either single-nucleotide variant (Luquette ) or copy-number aberration (CNA) calling (Zaccaria and Raphael, 2020). doubletD considers each droplet independently but borrows strength from the entire dataset while using a maximum likelihood approach in order to rapidly classify an experiment as either a doublet or singlet prior to downstream analyses. We demonstrate on both simulated and real datasets that these design choices allow doubletD to be utilized in conjunction with any downstream analysis of choice and therefore obviates the need for more complex downstream methods to individually account for the presence of doublets within their own models.

2 Materials and methods

2.1 Generative model

Similarly to scRNA-seq, there are two main types of high-throughput cell capture strategies in scDNA-seq: microfluidics and well-based protocols, which, respectively, distribute a cell suspension into either droplets or wells (Chen ; Hwang ). Here, we use the term ‘droplet’ independent of the used technology. Consider a scDNA-seq experiment with n droplets and m mutation loci that were identified after read alignment and variant calling. Each mutation locus has two alleles: a reference allele and a variant allele. Thus, we are given total read counts and variant counts, which are independent across droplets and loci. Read counts and of mutation locus j in droplet i are affected by (i) whether droplet i is a doublet (Section 2.1.1), (ii) the genotype(s) at locus j in the droplet (Section 2.1.2), and errors during sequencing including (iii) ADO (Section 2.1.3) and (iv) amplification bias and sequencing errors (Section 2.1.4). We make these relationships explicit in a generative model for and (Fig. 2).

Fig. 2.

Plate diagram of doubletD ’s graphical model. Observed total and variant read counts of m loci in n droplets are affected by doublet status , ADO and additional errors during sequencing

2.1.1 Doublet model

In the following, we will define random variables , where z indicates whether droplet i is a doublet (i.e. z = 1) or a singlet (i.e. z = 0). During the capture step, cells are released into a nozzle with a constant rate r and there is a fixed time-interval t in which a droplet is formed. The number of cells in a droplet is given by the number of cells that enter the nozzle in the time-interval during which the droplet is formed. Therefore, the prior on the doublet probability is a Poisson distribution with mean . Moreover, only nonempty droplets will yield sequence reads. This combined with the fact that doublets are composed of two cells, we have that z = 1, i.e. the event of droplet i being a doublet, equals where is the probability of occurrences (here cells) under a Poisson distribution with mean λ. In practice rt is very small (i.e. ), and thus the mass of the Poisson distribution is concentrated around two outcomes . Therefore, z can be approximately modeled by a Bernoulli distribution with probability of success so that Considering independence between distinct droplets, we get

2.1.2 Genotype model

We make the simplifying assumption that each mutation locus has copy number 2 in a single cell—we show robustness of violations to this assumption in Section 3.1. Thus the genotype of a locus j in a single cell can be in one of three states: (i) wild-type (wt) where both copies have the reference allele, (ii) heterozygous (het) with one variant and one reference copy and (iii) homozygous (hom) where both copies have the variant allele. Let and be the mutation probabilities at locus j of the three types, respectively, such that . Let indicate the VAF at locus j in droplet i. In case i is a singlet, we have that where for any locus j. On the other hand, if i is a doublet, we have that where for any locus j. For a droplet i comprising of a single cell (z = 0), the probability equals Following current single-cell literature (Gerstung ; Zafar ), we assume that a doublet contains two cells with independent genotypes. Therefore, we may define using probabilities as where gives all pairs (g, h) of VAFs in that result in the doublet VAF f. For example, a doublet VAF f = 1/2 results from two cells with pairs (g, h) of VAFs in the set .

2.1.3 ADO model

We follow the work in Posada (2020) and Zafar to model the shift in VAF due to ADOs. In this model, ADO is introduced by deciding for each cell whether a given allele is amplified or not according to a specific probability β known as the ADO rate. Dropout of distinct alleles is assumed to be independent and the ADO rate β is assumed to be constant for all cells and all loci. Although this could be easily extended to account for site-specific ADO as considered in other work (Lähnemann ); here, we opt for a global ADO rate to reduce the number of parameters. The VAF at locus j in droplet i after the dropout event depends on the VAF and doublet indicator z (Fig. 2). Specifically, each possible pair , where when z = 0 and when z = 1, can yield varying with probabilities that depend on the number of alleles that are dropped during amplification. Using that each mutation locus has copy number 2 in a single cell and allowing any number of copies to drop out, we have where if droplet i is a singlet. Conversely, if i is a doublet, we have where . Supplementary Table S1 lists all values of for varying and given ADO rate β. Supplementary Figure S1 shows an illustrative example of ADO.

2.1.4 Read count model

Beyond ADO, there are two types of additional errors that affect read counts and lead to an observed VAF that differs from the latent VAF after ADO: (i) copy errors, which occur early during PCR and lead to a propagation of incorrect nucleotides, and (ii) allelic imbalance, where amplification is biased toward one of the alleles (De Bourcy ). We model the resulting overdispersion with a beta-binomial as is standard in the field (Gerstung ; Lähnemann ; Zafar ). We use an uninformative prior on total read counts yielding While copy errors and uneven amplification errors happen simultaneously during the amplification stage, here, following Lähnemann , we employ a simpler model that assumes that the copy errors precede the allelic imbalance during amplification. We capture copy errors using a specified false positive rate , which is the probability of generating an alternate allele in the copy when the template has the reference allele, and false negative rate , which is the probability of generating a reference allele in the copy when the template has the alternate allele. Specifically, the probability of producing a copy with the alternate allele at locus j in experiment i is given by The number of variant reads resulting after amplification in the presence of allelic imbalance is modeled by the following beta-binomial distribution where s is the precision parameter that quantifies allelic imbalance error. A low precision s signifies high unevenness in amplification.

2.2 Posterior probability

To determine which droplets are doublets, we are interested in the posterior probability of for the given single-cell sequencing data , which is defined as In line with current methods (Zafar , 2019), we use independence of read counts across mutation loci and droplets and obtain We now express in terms of (described in Section 2.1.2), (described in Section 2.1.3) and (described in Section 2.1.4). Marginalizing over and yields where

2.3 doubletD

Our goal is to find such that the likelihood function [Equation (2)] is maximized. Substituting the doublet prior from Equation (1) in Equation (2) and taking , we get where K is the constant of proportionality. Since z is an indicator variable (i.e. ), we linearize the above equation in terms of using where and where the last equality uses doublet prior model [Equation (1)]. Note that, since the read counts are observed, the matrix is constant. Ignoring the constant of proportionality K, which is independent of , and using linearization of and in Equation (3), we get the following linear objective function: where is a constant defined as follows: Since is linear, we have the following closed-form solution maximizing

2.3.1 Implementation details

Our resulting method, doubletD, identifies given total and variant read counts with maximum posterior probability . To do so, doubletD requires input mutation probabilities and at each locus j used in the genotype model (Section 2.1.2), and the precision parameter s used in the read count model (Section 2.1.4). Supplementary Appendix A.1 describes a data-driven approach to estimate these parameters. Moreover, the doublet prior probability δ can either be taken as input or estimated by maximizing the posterior probability. doubletD is implemented in Python 3, is open source (BSD-3-Clause license), and is available at https://github.com/elkebir-group/doubletD.

3 Results

We evaluated the performance of doubletD via in-silico experiments with known ground-truth doublets (Section 3.1) as well as two real datasets: (i) a two cell line mixture (Section 3.2) and (ii) six patients with acute lymphoblastic leukemia (Gawad ) (Section 3.3).

3.1 In-silico experiments

We aim to answer the following questions: (i) Is doubletD agnostic to the choice of scDNA-seq assay and experimental design? (ii) How robust is doubletD to the presence of CNAs? (iii) Will the removal of doublets improve downstream analyses? To this end, we simulated scDNA-seq data of 10 genotypes under an evolutionary model that incorporates CNAs and SNVs, varying the number of SNVs , the doublet probability , the mean sequencing coverage and ADO probability . Each combination of simulation parameters was replicated with five different random number generator seeds, amounting to a total of 405 experiments. In each experiment, we simulated 500 in-silico droplet. We benchmarked our method against SCG (Roth ), a genotyping method for scDNA-seq data whose model optionally incorporates doublet detection, which we refer to as SCG:doublet, and Scrublet (Wolock ), a standalone doublet detection method designed for scRNA-seq data. We were not able to benchmark against SiCloneFit (Zafar ) and SCITE (Kuipers ), which are tree inference methods that also incorporate doublets, due to their prohibitive runtimes when run in doublet mode. Supplementary Appendix B.1 further details the simulation design, evolutionary model and method arguments. In particular, for SCG, we performed 25 restarts unless specified otherwise, using the restart with the maximum evidence lower bound (ELBO).

3.1.1 Assay and design agnosticism

We focus on simulations with a mean coverage of and simulated doublet probability of . We refer to Supplementary Appendix B.1 for other simulation regimes. While all three methods show increasing F1 scores (the harmonic mean between precision and recall) with increasing number m of mutations, doubletD achieves the highest F1 score (median: 0.88) compared to SCG:doublet (median: 0.76) and Scrublet (median: 0.37) (Fig. 3a). Specifically, we find that Scrublet has the worst performance in terms of both recall (median: 0.35) and precision (median: 0.38), demonstrating that doublet detection methods developed for scRNA-seq data cannot be directly applied to scDNA-seq data. While both doubletD and SCG:doublet have equivalently high precision (SCG:doublet median: 0.99 versus doubletD median: 0.98), doubletD has superior recall (median: 0.78) among all methods (median recall of 0.67 for SCG:doublet and 0.35 for Scrublet). Strikingly, SCG:doublet performs poorly in the regime of a small number m = 10 of mutations, with a median recall and precision of 0.21 and 1.00, compared to 0.70 and 0.87 for doubletD, respectively. Such small number of mutations do occur in practice—e.g. the ALL data analyzed in Section 3.3.

Fig. 3.

Simulations show that doubletD has high recall and precision in doublet detection, outperforming SCG and Scrublet across various experimental regimes and improving performance in downstream genotyping. (a) F1 score, precision and recall of doublet detection for the three competing methods (doubletD, SCG:doublet and Scrublet) in simulations with varying ADO rate β and number of mutations m in the absence of CNAs (γ = 0). (b) Recall of the three kind of doublets, i.e. neotypic, nested and selflet. (c) F1 score, precision and recall by method in the presence of CNAs () and varying ADO rate β. All results are for simulations with doublet probability , mean read depth and precision parameter s = 15 Zooming in on doublet type in Figure 3b, we find that all methods have the highest recall for neotypic doublets (median: 1.00 for doubletD, 1.00 for SCG:doublet and 0.50 for Scrublet), and that the recall increases for both nested and neotypic doublets with increasing number of mutations and increasing ADO. Notably, doubletD has the highest recall for nested doublets (median: 0.85) compared to SCG:doublet (median: 0.57) and Scrublet (median: 0.15). As expected, doubletD and SCG:doublet are unable to detect selflets for ADO rate 0.05 while Scrublet does detect a small proportion of selflets (median: 0.05). However, when ADO rate is 0.25, doubletD has significantly higher recall (median: 0.6) as compared with SCG:doublet (median: 0) and Scrublet (median: 0.2). Note that SCG:doublet is unable to detect selflets due to VAF discretization. Further, both SCG:doublet (IQR: 0.34–0.80) and Scrublet (IQR: 0.13–0.50) show large variance in recall rates as opposed to doubletD (IQR: 0.73–0.92). Additionally, we find that our method maintains its good performance in simulations when varying coverage and doublet probabilities (Supplementary Fig. S2). The lower bound of coverage for the in-silico experiments was . Even at such a low coverage, doubletD maintains its good performance (median precision: 0.83 and median recall 0.78, see Supplementary Fig. S2a). It is also important to note that doubletD’s improved performance does not come at the expense of running time (Supplementary Fig. S4a, median: 14.9 s versus s for SCG:doublet and 4.1 s for Scrublet). Finally, doubletD is robust to the choice of user-specified parameters such as the precision s (Supplementary Appendix B.1.4, Supplementary Figs S5–S8). In summary, we find that doubletD is robust to many variations in experimental assays and design, outperforming SCG:doublet and Scrublet.

3.1.2 Robustness with respect to CNAs

In order to evaluate the robustness of doubletD to the presence of CNAs, we generated simulations with varying probability of CNAs , where γ = 0 represents simulations with no CNAs. More specifically, for each locus that undergoes a CNA (with probability γ), we introduced a loss with probability and a gain otherwise. We ran SCG:doublet with five restarts due to increased runtimes compared to the copy-neutral simulations. Although doubletD does not explicitly account for CNAs, Fig. 3c shows that doubletD is robust to varying CNA probability γ, outperforming SCG:doublet and Scrublet in most regimes. Specifically, doubletD yields the highest recall (median: 0.79) with good precision (median: 0.98) resulting in the highest F1 score (median: 0.87) compared to SCG:doublet (median: 0.80) and Scrublet (0.36). While SCG:doublet has the same precision as doubletD (median: 0.98), this comes at the cost of lower recall (median: 0.73) compared to doubletD (median: 0.79). The robustness of doubletD can be explained by the observation that losses (deletions) introduced by CNAs behave similarly to ADOs, which is a key signal used by doubletD to detect doublets. We demonstrate the vulnerability of doubletD to copy number gains on simulations with highest possible CNA probability γ = 1 and lowest possible loss probability (Supplementary Fig. S3). Note that this kind of extreme presentation of CNAs is not observed in practice and that copy number losses including loss of heterozygosity events are common in cancer (El-Kebir, 2018; McPherson ; Satas ). In summary, we find that doubletD is robust to the presence of CNAs and outperforms both SCG:doublet and Scrublet in doublet detection.

3.1.3 Improving downstream genotype calling

SCG is a genotyping method for scDNA-seq data of tumors that includes doublet detection. It has two modes: in singlet mode (SCG:singlet) all droplets are considered singlets, whereas in doublet mode (SCG:doublet) genotypes and doublets in the sample are jointly inferred. Here, we assess whether the sequential use of doubletD followed by SCG:singlet (doubletD + SCG:singlet) performs better than SCG:singlet and SCG:doublet. In each of these settings, SCG is run with 25 restarts. Recall that each of our simulated instances contain 10 genotypes. To assess the performance of the three methods, we compute recall, precision and F1 score with respect to these ground-truth genotypes, considering a genotype as correctly inferred (i.e. a true positive) if it precisely matches a ground-truth genotype. Thus, if a method infers the exact set of 10 ground-truth genotypes, its recall, precision and F1 score will be 1. We find that doubletD + SCG:singlet has the highest F1 score (median: 0.95) compared to SCG:singlet (median: 0.73) and SCG:doublet (median 0.89) across all experimental regimes (Fig. 4a). SCG:singlet has good genotype recall (median: 0.9) but reduced precision (median: 0.64) since it misidentifies doublets as cells with distinct genotypes. SCG:doublet, on the other hand, has better precision (median: 1.0) but filters out rare genotypes misidentified as doublets resulting in reduced recall (median: 0.80). doubletD + SCG:singlet yields the highest recall (median: 0.90) and precision (median: 1.0). In general, SCG:singlet calls more genotypes (median: 14) while SCG:doublet calls fewer genotypes (median: 8.5) compared to the ground truth of 10 genotypes (Supplementary Fig. S9). On the other hand, doubletD + SCG:singlet is closer to ground truth with a median of 9.5 distinct genotypes. Furthermore, Fig. 4b shows that doubletD + SCG:singlet takes orders of magnitude less time compared to SCG:doublet. While SCG:singlet takes the least time to run, it also yields the lowest F1 score (Fig. 4a).

Fig. 4.

Simulations show that removal of doublets using doubletD improves downstream genotype calling with reduced runtime. (a) F1 score, precision and recall of genotypes for doubletD + SCG:singlet, SCG:doublet and SCG:singlet for varying number of mutation m and ADO rate β and without CNAs (γ = 0). (b) Running time for genotype calling using doubletD + SCG:singlet, SCG:doublet and SCG:singlet for simulations with varying number of mutations m without CNAs (γ = 0). All results are for simulations with doublet probability , mean read depth and precision parameter s = 15 In summary, we find that the use of doubletD improves genotype calling of SCG while incurring runtimes comparable to SCG in singlet mode. This suggests that doublet removal using doubletD is a useful preprocessing step for downstream analyses of scDNA-seq data of tumors.

3.2 Mixture of two cell lines

We validated doubletD on a dataset of droplets comprised of a 50–50% mix of KG-1 and Raji cell lines (with m = 26 loci) captured by Mission Bio’s Tapestri platform and sequenced by Illumina NextSeq (https://portal.missionbio.com/datasets/KG-1-Raji-50-50-Myeloid). Supplementary Appendix B.2 details the data preparation, including the exclusion of 23 cells that had a genotype distinct from the two cell lines. KG-1 had 12 heterozygous (het), 7 wt and 7 homozygous loci, while Raji had 11 heterozygous, 7 wt and 8 homozygous loci (Supplementary Fig. S10). The mean sequencing coverage c was . Following the procedure outlined in Supplementary Appendix A.1, we fit beta-binomial precision s = 10.5, and locus-specific mutation probabilities to the observed variant and total read counts . We used the experimental ADO rate ( previously estimated by Morita on a large patient cohort using Mission Bio’s Tapestri platform. There are two unique characteristics of this dataset that permit identification of neotypic doublets for orthogonal validation: (i) the droplets are easily clustered into two clones by the cell line of origin (Supplementary Fig. S10) and (ii) the droplets are comprised of distinct cell lines with distinct evolutionary histories. These characteristics are uncommon in regular datasets where the number of clones and associated genotypes is unknown a priori and droplets originate from a single tumor whose clones have a shared evolutionary history. As such, we conclude that doublets will be either neotypic (one cell from each cell line), or selflets (two cells from one cell line). Using the property that the two cell lines have independent origins and relaxing Mission Bio’s standard filtering criteria, we identified an additional set of five validation loci with distinct wt/homozygous states among the two cell lines, i.e. each validation locus has state wt (hom) in one cell line and hom (wt) in the other (Fig. 5a). Recall that a singlet i will have an observed VAF of approximately 1 if locus j is homozygous and VAF 0 if locus j is wt. As such, any droplets with observed VAF not close to either 0 or 1 (Fig. 5a) indicate that the droplet may be a neotypic doublet comprised of a cell from each cell line. We therefore assign a neotypic doublet confidence score (NCS) to each droplet, counting the number of validation loci with VAF between 0.15 and 0.85. This approach yielded droplets with , 33 droplets with and 42 droplets with . Note that the NCS is specifically designed to express confidence that a doublet is neotypic but does not capture selflets. Supplementary Figure S10 shows a comparison of the observed VAF of droplets categorized by cell line droplets with a .

Fig. 5.

doubletD successfully recalls all orthogonally validated high confidence neotypic doublets and identifies putative selflets in a two cell line mixture dataset. (a) The VAF for each droplet at each of the five validation loci. Droplets are assigned a NCS, which is the number of validation loci whose VAF was in the range (dotted lines). (b) The resulting proportion (total) of droplet calls by method (doubletD, SCG:doublet and Scrublet) by prediction (singlet, doublet) and NCS. (c) The aggregated observed VAF distribution by doubletD prediction and cell line for droplets with . The number of droplets in the aggregate are shown in the parentheses We ran doubletD, SCG:doublet (with five restarts) and Scrublet. Since we did not know the true doublet probability δ, we used the maximum likelihood criterion to establish the estimate the doublet probability for doubletD as (Supplementary Fig. S11a). However, we provided SCG and Scrublet with the doublet probability as estimated by Mission Bio in similar cell line experiments (Mission Bio, 2019). For each method and NCS, we calculated the proportion of predicted singlets and doublets (Fig. 5b). doubletD identified the most droplets as doublets (54), followed by SCG:doublet (42) and Scrublet (30). doubletD predicted 100% of doublets with whereas SCG:doublet identifies 95.2% of these droplets with similarly high NCS. Scrublet is the worst performing, identifying only 61.9% of such droplets (Fig. 5b). In terms of running time, SCG:doublet took s, doubletD took 24.1 s and Scrublet took 2.4 s. All three methods designated the same droplet at as a doublet. This suggests that for the remaining 32 droplets at the observed VAF in at one of these five validation loci is likely attributable to amplification and sequencing error. The one doublet identified by all methods does appear to be neotypic as evidenced by an observed VAF of 0.39 for the validation locus on chromosome 17, which is far from the cut off criterion of 0.15 and is hard to explain by other errors. Furthermore, the VAF distribution across the 26 inference loci for this droplet has a peak at 0.25 and is strikingly different from the distribution of the other Raji droplets with NCS equal to 1 (Supplementary Fig. S11b). Lastly, doubletD identifies 11 (proportion: 0.007) putative selflets at , 3 of which are KG-1 and 8 are Raji. SCG calls 1, which was also called by doubletD, and Scrublet calls 3 such droplets with only one called by doubletD. Corroborating this, we note a visual difference in the aggregated VAF distribution across the inference 26 loci between doubletD predicted singlets and doublets with (Fig. 5c). A Venn diagram of the droplets with different NCS score that were predicted as doublets by the three competing methods is shown in Supplementary Fig. S12. In summary, doubletD is able to recall all orthogonally validated high confidence neotypic doublets (with NCS ) as well as successfully distinguish the VAF signal of neotypic doublets from sequencing-related error. In addition, we suspect that doubletD is able to recall a small number of selflets even in the presence of low ADO rates ().

3.3 Phylogeny inference of an acute lymphoblastic leukemia patient

As discussed in Section 1, while nested doublets and selflets do not yield new genotypes, neotypic doublets can be mistaken as an additional clone with a unique genotype (Navin and Chen, 2016). In the extreme case of a phylogeny with only two branches, neotypic doublets that correspond to the two leaves of this tree will include all mutations. Consequently, phylogeny inference under the infinite sites assumption will yield a linear phylogeny. Here, we investigate the impact of doublets on phylogeny inference for a patient (Patient 1) in an acute lymphoblastic leukemia cohort previously suspected to contain doublet droplets (Gawad )—we refer to Supplementary Table S2 for doubletD results of the other patients. Gawad sequenced 243 droplets and identified 20 mutations for Patient 1. We analyzed this patient using PhISCS-B (Malikic ), which is a phylogeny inference method that seeks to identify a tree constrained by the infinite sites assumption. Since it does not account for doublets, PhISCS-B requires doublets be removed in a preprocessing step. While SCG:doublet was unable to identify any doublets, doubletD identified 50 doublets for this patient. Supplementary Figure S13 corroborates these doublets, showing distinct VAF distributions between singlets and doublets for an orthogonal set of holdout loci. We ran PhISCS-B in single-cell data mode on the complete set of droplets (including doublets) as well as the set of droplets without doublets (details in Supplementary Appendix B.3). Figure 6 shows that doublet removal in this patient results in a branching phylogeny with a higher mean likelihood () compared to a linear phylogeny () on the complete set of droplets. Furthermore, the branching pattern observed in the inferred phylogeny after doublet removal is in agreement with several other trees published for Patient 1, with identical grouping of the mutations across the two branches (Gawad ; Kuipers ; Malikic ).

Fig. 6.

Doublets lead to incorrect phylogeny inference in acute lymphoblastic leukemia patient 1. (a) PhISCS-B returns a linear phylogeny with mean log likelihood of if the 50 doublets detected by doubletD are retained. (b) PhISCS-B returns a branching phylogeny with higher mean likelihood of Thus, phylogeny inference is an additional example of a downstream analysis where the inclusion of doublets may yield incorrect conclusions.

4 Discussion

In this work, we introduced doubletD, the first standalone method for detecting doublets in scDNA-seq data with medium to high coverage suitable for single-nucleotide variants. Our method operates directly on variant and total read counts of mutation loci. Underlying our method is the observation that doublets in scDNA-seq data have a characteristic VAF distribution. An additional signal that we exploit is the shift in VAFs due to ADO. This unique approach enables doubletD to capitalize on a major downside of single-cell sequencing in order to identify selflets and nested doublets, that are notoriously hard to detect by current methods. doubletD utilizes a probabilistic model that specifically accounts for allelic imbalance and dropout during whole-genome amplification in scDNA-seq as well as sequencing errors. We introduced a closed-form solution for the inference problem. We demonstrated that our method outperforms current methods for downstream analysis of scDNA-seq data that jointly infer doublets and genotypes (Roth ) as well as standalone approaches for doublet detection in scRNA-seq data (Wolock ). Moreover, we showed that removing doublets using doubletD improves the accuracy and efficiency of downstream analyses such as genotype calling and phylogeny inference. There are several opportunities for future work. First, while this paper focused on cancer, doubletD can be applied to normal samples as well using heterozygous germline SNPs. Moreover, the same characteristic signal used by our method to detect doublets can be used to detect cells that have undergone whole-genome duplication or are in S-phase with actively replicating DNA. Second, our approach can be extended to support low () to ultra-low () coverage scDNA-seq samples, suitable for CNAs, by pooling heterozygous germline SNPs located within haplotype blocks. Third, our current formulation assumes that normal cells are diploid. As noted in our simulations, performance slightly decreased in the presence of CNAs. We plan to extend our probabilistic model to account for copy number. Finally, we envision that doubletD will improve downstream analysis of current and future methods, making doublet detection and removal a standard practice in scDNA-seq analysis.

Funding

M.E-K. was supported by the National Science Foundation under award numbers CCF 1850502 and CCF 2046488. Conflict of Interest: none declared.

Data availability statement

The data underlying this article are available at https://github.com/elkebir-group/doubletD_data. Click here for additional data file.

31 in total

1. Monovar: single-nucleotide variant detection in single cells.

Authors: Hamim Zafar; Yong Wang; Luay Nakhleh; Nicholas Navin; Ken Chen
Journal: Nat Methods Date: 2016-04-18 Impact factor: 28.547

2. Genotyping tumor clones from single-cell data.

Authors: Nicholas E Navin; Ken Chen
Journal: Nat Methods Date: 2016-06-29 Impact factor: 28.547

3. DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors.

Authors: Christopher S McGinnis; Lyndsay M Murrow; Zev J Gartner
Journal: Cell Syst Date: 2019-04-03 Impact factor: 10.304

4. SCARLET: Single-cell tumor phylogeny inference with copy-number constrained mutation losses.

Authors: Gryte Satas; Simone Zaccaria; Geoffrey Mon; Benjamin J Raphael
Journal: Cell Syst Date: 2020-04-22 Impact factor: 10.304

5. Clonal genotype and population structure inference from single-cell tumor sequencing.

Authors: Andrew Roth; Andrew McPherson; Emma Laks; Justina Biele; Damian Yap; Adrian Wan; Maia A Smith; Cydney B Nielsen; Jessica N McAlpine; Samuel Aparicio; Alexandre Bouchard-Côté; Sohrab P Shah
Journal: Nat Methods Date: 2016-05-16 Impact factor: 28.547

6. A quantitative comparison of single-cell whole genome amplification methods.

Authors: Charles F A de Bourcy; Iwijn De Vlaminck; Jad N Kanbar; Jianbin Wang; Charles Gawad; Stephen R Quake
Journal: PLoS One Date: 2014-08-19 Impact factor: 3.240

7. CellCoal: Coalescent Simulation of Single-Cell Sequencing Samples.

Authors: David Posada
Journal: Mol Biol Evol Date: 2020-05-01 Impact factor: 16.240

8. SiCloneFit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data.

Authors: Hamim Zafar; Nicholas Navin; Ken Chen; Luay Nakhleh
Journal: Genome Res Date: 2019-10-18 Impact factor: 9.043

9. Improving Single-Cell Encapsulation Efficiency and Reliability through Neutral Buoyancy of Suspension.

Authors: Hangrui Liu; Ming Li; Yan Wang; Jim Piper; Lianmei Jiang
Journal: Micromachines (Basel) Date: 2020-01-15 Impact factor: 2.891

10. Clonal evolution of acute myeloid leukemia revealed by high-throughput single-cell genomics.

Authors: Kiyomi Morita; Feng Wang; Katharina Jahn; Tianyuan Hu; Tomoyuki Tanaka; Yuya Sasaki; Jack Kuipers; Sanam Loghavi; Sa A Wang; Yuanqing Yan; Ken Furudate; Jairo Matthews; Latasha Little; Curtis Gumbs; Jianhua Zhang; Xingzhi Song; Erika Thompson; Keyur P Patel; Carlos E Bueso-Ramos; Courtney D DiNardo; Farhad Ravandi; Elias Jabbour; Michael Andreeff; Jorge Cortes; Kapil Bhalla; Guillermo Garcia-Manero; Hagop Kantarjian; Marina Konopleva; Daisuke Nakada; Nicholas Navin; Niko Beerenwinkel; P Andrew Futreal; Koichi Takahashi
Journal: Nat Commun Date: 2020-10-21 Impact factor: 17.694

1 in total

1. CellPhy: accurate and fast probabilistic inference of single-cell phylogenies from scDNA-seq data.

Authors: Alexey Kozlov; Joao M Alves; Alexandros Stamatakis; David Posada
Journal: Genome Biol Date: 2022-01-26 Impact factor: 13.583

1 in total