Literature DB >> 35758771

Phylovar: toward scalable phylogeny-aware inference of single-nucleotide variations from single-cell DNA sequencing data.

Mohammadamin Edrisi¹, Monica V Valecha², Sunkara B V Chowdary³, Sergio Robledo⁴, Huw A Ogilvie¹, David Posada^2,5,6, Hamim Zafar^3,7,8, Luay Nakhleh¹.

Abstract

MOTIVATION: Single-nucleotide variants (SNVs) are the most common variations in the human genome. Recently developed methods for SNV detection from single-cell DNA sequencing data, such as SCIΦ and scVILP, leverage the evolutionary history of the cells to overcome the technical errors associated with single-cell sequencing protocols. Despite being accurate, these methods are not scalable to the extensive genomic breadth of single-cell whole-genome (scWGS) and whole-exome sequencing (scWES) data.
RESULTS: Here, we report on a new scalable method, Phylovar, which extends the phylogeny-guided variant calling approach to sequencing datasets containing millions of loci. Through benchmarking on simulated datasets under different settings, we show that, Phylovar outperforms SCIΦ in terms of running time while being more accurate than Monovar (which is not phylogeny-aware) in terms of SNV detection. Furthermore, we applied Phylovar to two real biological datasets: an scWES triple-negative breast cancer data consisting of 32 cells and 3375 loci as well as an scWGS data of neuron cells from a normal human brain containing 16 cells and approximately 2.5 million loci. For the cancer data, Phylovar detected somatic SNVs with high or moderate functional impact that were also supported by bulk sequencing dataset and for the neuron dataset, Phylovar identified 5745 SNVs with non-synonymous effects some of which were associated with neurodegenerative diseases.
AVAILABILITY AND IMPLEMENTATION: Phylovar is implemented in Python and is publicly available at https://github.com/NakhlehLab/Phylovar.

Entities: Chemical

Mesh：

Substances：
Nucleotides

Year: 2022 PMID： 35758771 PMCID： PMC9235480 DOI： 10.1093/bioinformatics/btac254

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

With the advent of the first single-cell sequencing (SCS) techniques (Navin ; Tang ), the fields of single-cell genomics, transcriptomics, proteomics and epigenetics have witnessed remarkable growth over the last decade. Single-cell sequencing technologies have impacted our understanding in different fields of biology including developmental biology, immunology, microbiology and cancer biology (Kashima ; Lim ; Navin, 2014; Tang ; Wang and Navin, 2015). Single-cell DNA sequencing (scDNAseq), as one of the SCS technologies, provides insights into the somatic evolutionary process by sequencing the genomic contents of a complex tissue at a single-cell resolution (Navin, 2014; Navin ). Preparing scDNAseq data requires a whole-genome amplification (WGA) process to amplify the DNA material of a single cell to suffice the amount of DNA needed for sequencing. (Kashima ; Zafar ). WGA technologies, such as multiple displacement amplification (MDA) (Dean ; Spits ) and multiple annealing and looping-based amplification cycles (MALBAC) (Zong ) can elevate the noise level in scDNAseq data. The scDNAseq technical errors include allelic dropout (ADO), false-positive (FP) errors, false-negative (FN) errors and non-uniform coverage (Navin, 2014; Zafar ). ADO refers to cases where only one of the two alleles in a heterozygous mutation is amplified, resulting in the loss of the mutated allele. FP artifacts can appear due to uneven amplification or at the early stages of the amplification when the original nucleotide is substituted randomly. The non-uniform coverage over different genomic loci may result in missing data due to zero or insufficient coverage. The scDNAseq-specific technical errors fueled the development of tools such as Monovar (Zafar ) and SCcaller (Dong ) for detecting single-nucleotide variations (SNVs) from scDNAseq data. Although Monovar and SCcaller account for uneven coverage and scDNAseq-specific errors, more recent methods, SCI (Singer ) and scVILP (Edrisi ), showed further improvement in overcoming the scDNAseq-specific technical errors by simultaneously inferring the cells’ phylogeny and SNVs. SCI uses a Markov chain Monte Carlo (MCMC) algorithm to sample the joint posterior distribution of SNVs and the phylogenetic tree of the single cells and reports the tree(s) with the best posterior probability and the corresponding genotypes. scVILP is formulated as an instance of Mixed Integer Linear Programming (MILP) and it aims to find maximum likelihood estimation (MLE) of the observed read counts given the underlying genotype matrix. Here, the MILP solver is restricted to proposing only the genotype matrices that satisfy three-gamete condition in order to maximize the likelihood function (see Estabrook ; Fernández-Baca, 2001; Gusfield, 1991, 1997; Meacham, 1983; Semple and Steel, 2003 for more details on work related to inference under the three-gamete condition). Although ‘regularizing’ the mutation detection by using a tree as a guide is a promising direction (Edrisi ; Kuipers ; Markowska ; Singer ), applying SCI and scVILP to datasets with large number of loci such as in Evrony and Wang is challenged by either very long running time or large memory consumption of the methods—the major issues in SCI and scVILP, respectively. Indeed, scVILP runs out of memory on all of the datasets considered in our study here, except for the smallest ones, which is why we do not report on the performance of scVILP. To address this challenge, we developed Phylovar, a likelihood-based method for phylogeny-aware inference of SNVs from scDNAseq datasets consisting of a large number of loci. To simplify likelihood calculations for large-scale data, we assume that mutations occur following an infinite-sites assumption (ISA) (Deshwar ; Jahn ; Singer ). Using this model, Phylovar finds the tree topology and the placement of mutations on ancestral single cells that maximize the likelihood of the erroneous observed read counts given the genotypes. Utilizing a vectorized formulation for likelihood calculations, Phylovar benefits from the vectorized operations in matrix manipulation packages such as NumPy (Harris ) to scale up to many loci. We compared the SNV calling accuracy, memory consumption and running time of Phylovar against those of the existing methods, Monovar and SCI, through a simulation study. We found that Phylovar outperforms SCI in terms of running time with the same accuracy, while being more accurate than Monovar. Furthermore, we applied our method to two biological datasets: a triple-negative breast cancer (TNBC) dataset (Wang ) consisting of 32 single cells and 3375 candidate loci, as well as the dataset from Evrony containing 16 normal human neuron cells and 2 489 545 candidate loci. For the TNBC data, Phylovar inferred 652 SNVs with ‘high’ or ‘moderate’ functional impact, out of which 550 (84%) were also supported by bulk sequencing. For the neuron cells, Phylovar identified 5745 SNVs with non-synonymous effects some of which were related to neurodegenerative diseases. To the best of our knowledge, Phylovar is the first scDNAseq SNV caller that can utilize the underlying tree structure even when the dataset contains millions of genomic loci.

2 Materials and methods

The input to Phylovar consists of the reference and variant count matrices, denoted by and , where N and M represent the number of single cells and candidate loci, respectively. Each entry in R and V represents the number of reference and variant counts, respectively, at cell i and site j. These count matrices are obtained from an input file in mpileup format. Here, candidate loci are defined as the genomic loci with a significant number of variant reads. This significance is measured by a statistic test. Note that these loci may not necessarily contain SNVs since the variant reads might be artifacts of scDNAseq technical errors. In all experiments reported below, we used SCI’s likelihood ratio test described in Singer to identify candidate loci for the analyses. If the total read coverage at a cell and a candidate site is less than λ, the corresponding entry is treated as missing data. We used λ = 1 in practice.

2.1 Single-cell genotype error model

Our genotype model considers bi-allelic genotype with 0 and 1 representing the absence and presence of a mutation, respectively. We differentiate true genotypes from those being subject to scDNAseq errors that propagate from WGA to the sequencing library—called library genotypes. Let be the binary matrix containing the true genotypes where g represents the true genotype at cell i and locus j. Similarly, we denote the library genotype matrix by . We assume library preparation process introduces FP and FN errors into the data resulting in difference between true genotypes and their corresponding library genotypes. Let α and β denote the FP and FN error rates, respectively. Then, the probability of the library genotype given true genotype and error rates are given by the following error model adopted from SiFit (Zafar ) and SiCloneFit (Zafar ):

2.2 Single-cell read count model

For the convenience of notation, let denote the total read coverage at cell i and locus j. We assume that variant read counts follow a binomial distribution whose success probability depends on the value of the library genotype: The variables μ0 and μ1 are the success probabilities associated with reference and alternate alleles, respectively. In practice, we set μ0 to 0.001, which is at the same order of magnitude for the error rate in different Illumina sequencing platforms (Stoler and Nekrutenko, 2021). We used 0.5 for the value of μ1, which is the mean of variant read counts for a heterozygous mutation.

2.3 Tree model

Our tree model consists of two components: a binary tree topology —where V denotes the set of nodes, and E is the set of edges—and a mutation placement for each genomic locus j, . The latter is a binary vector of length containing binary elements for each leaf or internal node in V. We take to denote that a mutation occurred at node q during the evolutionary history of locus j. In our model, we assume mutations evolve following the ISA. According to this model, at most one element in is allowed to be 1. This vector requires each node to have an index from . For simplicity, we map indices to the leaves/single cells and use the same mapping for the single cells in all tree topologies.

2.4 Log-likelihood function

Assuming independence across sites/loci, the log-likelihood function of read counts given true genotypes G, error rates , underlying tree topology T and mutation placements for all loci (denoted by ) are: Note that the above likelihood is based on G rather than T and directly, as G is derived from T and . Therefore, we can drop T and from Equation (3). It can be shown that after marginalizing out ψ’s, can be simplified as follows: Here, the log-likelihood values of the missing data are assumed to be 0. The MLE solution is obtained as

2.5 Hill-climbing search algorithm

Phylovar infers the underlying phylogeny of single cells and their genotypes simultaneously in a hill-climbing fashion. At each step, the log-likelihood function is evaluated and updated by proposing one of the underlying parameters including the tree, mutation placements and error rates. We start the search by reconstructing an initial tree topology. To obtain this tree, first, we create the matrix of initial genotypes, , as follows: Here, are the initial estimates of the error rates. Given , we calculate the pairwise Hamming distances between the single cells and build an initial tree topology, , using the neighbor-joining algorithm (Saitou and Nei, 1987). Given the proposed parameters , the mutation placement with highest log-likelihood for each site j—denoted by —is determined (see below for more details) yielding the genotype matrix at first iteration, and the first log-likelihood value . At each iteration , either new error rates are estimated (see below for more details) or a new tree is proposed by performing tree rearrangement techniques including subtree pruning and re-grafting (SPR), nearest-neighbor interchange (NNI) and swapping two random leaves. The proposed parameters are accepted if the new log-likelihood value is greater than or equal to the log-likelihood in the previous iteration. In case of stochastic hill-climbing, the acceptance probability of the newly proposed log-likelihood value is: The search procedure terminates when the log-likelihood does not improve after a user-specified number of iterations or when it reaches the maximum number of iterations.

2.6 Proposing new error rates

The new error rates at iteration t are calculated using the following equations from the entries of and : Here, the number of 0 entries in that were ‘corrected’ to 1 in provides a measure of what a more realistic α would be through the hill-climbing trajectory. The same rationale applies to proposing a new value of β. In Equations (8) and (9), and indicate the new values of α and β at iteration t, respectively. Here, and denote the entries of the initial genotype matrix and the genotype matrix at iteration t–1, respectively. The term indicates the entries with non-zero read counts. The symbol represents the logical AND operator.

2.7 Finding the best mutation placement

Given a topology and a site j, each possible mutation placement on yields a unique genotype configuration at the level of single cells. We seek the mutation placement with the highest log-likelihood. Let denote the mutation placement when the kth node is mutated. As a special case, let represent the absence of mutation. Because the set of all possible mutation placements is the same for all the sites, we dropped the index j from these two notations. To summarize the effect of all possible ISA mutation placements on genotype configurations, we define whose kth row, , represents the genotype configuration corresponding to . We formally define the mapping from mutation placements to genotypes, denoted by , as follows: where and . Here, denotes the subtree rooted at node v. Note that the mapping is one-to-one, so we can use its inverse to retrieve the genotypes given a mutation placement. In addition to , we define two other matrices that store the log-likelihood values from Equation (4), one assuming all genotypes are 0, called matrix of zero-allele likelihoods, and the other assuming all genotypes are 1, called matrix of one-allele likelihoods, . Formally, we define the matrices and as follows: It can be shown that the following matrix multiplication results in a matrix whose each element is equal to the log-likelihood value of at site j: Here, is matrix of all-ones. The best mutation placement at site j, , is associated with the highest value in the jth column of : The index corresponding to the highest value is denoted by : Using , we can determine the best genotype configuration at site j which constitutes the column of using Finally, is the concatenation of best genotypes configurations at all sites:

3 Results and discussion

3.1 Simulation study

We first compared the computational efficiency and SNV calling accuracy of Phylovar, SCI and Monovar using synthetic datasets simulated under five scenarios: (i) varying the number of mutations, (ii) varying the number of cells, (iii) varying the ADO rates, (iv) copy number effect and (v) violation of ISA. We simulated the datasets using the simulator introduced in Singer . In the first scenario, we investigated how Phylovar performs compared to the other methods when increasing the number of mutations dramatically to a large extent. We simulated datasets containing 16 single cells with 1000, 104 and 105 mutations. For each mutation value, ten datasets were generated. Phylovar’s accuracy was comparable to that of SCI in terms of F1 measure, while both SCI and Phylovar were more accurate than Monovar because of accounting for evolutionary history (Fig. 1a). To measure the running time of each method, we used the CPU clock time (in seconds and log scale) during its execution. Figure 1f shows that the running time of each method increased with the number of mutations. For the largest dataset with 105 mutations, Phylovar was approximately two orders of magnitude faster than SCI.

Fig. 1.

Summary statistics of different benchmarking experiments. (a–e) F1 accuracy of the methods from simulated data with different number of mutations (a), number of cells (b), ADO rate (c), copy number rate (d) and fraction of ISA violations (e). (f, g) Runtime of the methods on simulated data with varying number of mutations (f) and varying number of cells (g). (h) Linear regression between estimated false-negative error rates (β’s) and actual ADO rates used for simulated data In the second scenario, we sought to answer how the methods’ performances depend on the number of cells. Here, we fixed the number of mutations at 105 and varied the number of cells, . For each setting, we generated ten datasets. The F1 accuracy scores of all methods improved as the number of cells increased (Fig. 1b). Again, Phylovar’s accuracy was comparable to that of SCI while it outperformed Monovar. We observed that increasing the number of single cells improved the accuracy of all methods more than increasing the number of mutations. As demonstrated in Figure 1g, similar to the first scenario, the running time of Phylovar was almost two orders of magnitude less than that of SCI. In the third scenario, the ADO rate was varied while cells and mutations were fixed at 16 and 1000. We selected ADO rates from the values , and generated ten datasets for each ADO value. Figure 1c shows that both SCI and Phylovar were more robust to high ADO rates than Monovar due to the utilization of the underlying single-cell phylogeny. Since Phylovar can estimate the false-negative error rates (denoted by β), we measured the correlation between the ADO rates used for generating the simulated datasets and the estimated β’s. As demonstrated in Figure 1h, these two values were highly correlated (the Pearson correlation coefficient was 0.991). It is worth noting that based on the linear regression line, the estimated β was almost half of the true ADO, pointing to the difference between the dropout mechanism in the simulator and our definition of β (see Section 2). Given an ADO rate μ, the simulator chooses μ fraction of the mutations. It changes of them into reference genotype, and of them into homozygous mutations; the β in our model indicates the probability of a mutation becoming reference genotype implying , which we can observe in Figure 1h. Since Phylovar assumes the read counts are originated from diploid strands, in the fourth scenario, additional wild-type alleles were introduced to the read counts to imitate the effect of copy number changes. The simulator randomly selects a fraction of mutated loci (named copy number rate), and chooses c extra copies for each loci with probability (Singer ). We increased the copy number rate from 0 to 0.5 with step size 0.125. For each value, we generated ten datasets containing 16 cells and 1000 mutations. Figure 1d shows that the SNV calling accuracy of the methods decreased as more mutated loci were subject to copy number changes. In the fifth scenario, we were interested in observing how violations of ISA affect the SNV calling accuracy of the methods. Given a fraction of mutations, the simulator randomly selects half of them to recur in different branches, and the rest of them to be lost in the same subtree. We increased the fraction of mutations subject to ISA violations from 0 to 0.15 with 0.05 step size. For each value we generated ten datasets with 16 cells and 1000 mutations. Figure 1e shows that all three methods had a stable performance as the fraction of ISA violations increased. This observation suggests that even though the phylogenies inferred by SCI and Phylovar might be inaccurate due to the presence of violations of their evolutionary model, the effect of such violations on mutation inference is negligible.

3.2 Application to real data

We applied Phylovar on two human scDNAseq datasets. The first dataset consists of single-cell whole-exome sequencing (scWES) samples from a triple-negative breast cancer (TNBC) patient (Wang ). Since the population sequencing data from bulk tumor and matched normal tissue are available for the TNBC dataset, the number of mutations shared by scWES and bulk data provides us a metric for measuring the accuracy of our approach. The TNBC dataset consists of 16 diploid cells, eight hyperdiploid/aneuploid cells and eight hypodiploid cells (Wang ). Given the control normal cells, SCI’s likelihood ratio test identifies the loci likely to contain somatic mutations. Applying this statistic test on the input mpileup file resulted in 3375 candidate loci on which we applied Phylovar. Phylovar was run with ten parallel hill-climbing chains, each for 100 000 iterations on a pool of five CPU’s, each with 48 cores (AMD EPYC 7642) on a node with 192 GB RAM. The total runtime was 91 min. Phylovar inferred an 18.21% false-negative error rate and a 1.03% false-positive error rate from TNBC data. We ran SCI and Monovar with default parameters; SCI and Monovar terminated after 10 h and 144 min, respectively. Figure 2 shows the three methods’ mutation calls on TNBC data from the overlapping sites as well as the initial genotype matrix at the first iteration of our hill-climbing search. We performed hierarchical clustering with Ward’s minimum variance method implemented in Python’s SciPy package (Virtanen ) on the genotype matrix for better visualization. We observed concordance between the calls from Phylovar and SCI while Monovar’s calls are noisy and resemble Phylovar’s initial genotypes.

Fig. 2.

Clustered heatmaps of mutation calls by different methods performed on the TNBC dataset. Here, rows and columns represent the genomic loci and the single cells, respectively. The pixels show mutation calls (dark blue), reference alleles (light blue) and missing data (pink). The initial genotypes are the initial estimates of genotypes considering no error rates and no underlying phylogeny at the starting step of Phylovar’s search algorithm (A color version of this figure appears in the online version of this article) To annotate the mutations, we applied snpEff (Cingolani ) on the SNVs detected by Phylovar. Out of 3375 candidate loci, 652 loci contained SNVs with ‘high’ or ‘moderate’ functional effects (see https://pcingola.github.io/SnpEff/se_inputoutput/ for details on the types of variants’ effects and their descriptions). Then, we ran HaplotypeCaller (GATK version 4.2.0.0) for mutation calling on the bulk tumor and normal samples. Among 652 SNVs in single cells, 550 (84%) mutations were found in bulk data (Fig. 3). Performing this analysis on the results of SCI yielded the same results as for Phylovar: 652 loci contained SNVs with high or moderate functional effect, out of which 550 mutations overlapped with the calls from bulk data. Monovar detected 18 187 high or moderate non-synonymous mutations in the single-cell data from which 10 981 mutations were found in the bulk data as well which is 60% of the single-cell calls (Figs. 4 and 5).

Fig. 3.

Fig. 4.

Clustered heatmap of the mutations detected by SCI from TNBC data and population sequencing data of tumor and matched normal tissues. The pixels show mutation calls (dark blue), reference alleles (light blue) and missing data (pink). Columns (cells) are colored according to the ploidy of the cells. Since SCI discarded the diploid cells from the final output, only hypodiploid and hyperdiploid cells are demonstrated here. The colors of the rows (genes) indicate whether the SNV was found in bulk data or not. Here, popT and popN are the tumor and normal population sequencing samples, respectively. SCI detected 652 SNVs with high or moderate functional effects in the single-cell data among which 550 mutations were found in bulk data as well, which is 84% of the single-cell calls (A color version of this figure appears in the online version of this article)

Fig. 5.

Clustered heatmap of the mutations detected by Monovar from TNBC data and population sequencing data of tumor and matched normal tissues. The pixels show mutation calls (dark blue), reference alleles (light blue) and missing data (pink). Columns (cells) are colored according to the ploidy of the cells. The colors of the rows (genes) indicate whether the SNV was found in bulk data or not. Here, popT and popN are the tumor and normal population sequencing samples, respectively. Monovar detected 18 187 high or moderate non-synonymous mutations in the single-cell data from which 10 981 mutations were found in the bulk data as well which is 60% of the single-cell calls (A color version of this figure appears in the online version of this article)

Clustered heatmap of the mutations detected by Phylovar from TNBC data and population sequencing data of tumor and matched normal tissues. The pixels show mutation calls (dark blue), reference alleles (light blue) and missing data (pink). Columns (cells) are colored according to the ploidy of the cells. The colors of the rows (genes) indicate whether the SNV was found in bulk data or not. Here, popT and popN are the tumor and normal population sequencing samples, respectively. Out of 3375 candidate loci, 652 loci contained SNVs with high or moderate functional effects in the single-cell data among which 550 mutations were found in bulk data as well, which is 84% of the single-cell calls (A color version of this figure appears in the online version of this article) Clustered heatmap of the mutations detected by SCI from TNBC data and population sequencing data of tumor and matched normal tissues. The pixels show mutation calls (dark blue), reference alleles (light blue) and missing data (pink). Columns (cells) are colored according to the ploidy of the cells. Since SCI discarded the diploid cells from the final output, only hypodiploid and hyperdiploid cells are demonstrated here. The colors of the rows (genes) indicate whether the SNV was found in bulk data or not. Here, popT and popN are the tumor and normal population sequencing samples, respectively. SCI detected 652 SNVs with high or moderate functional effects in the single-cell data among which 550 mutations were found in bulk data as well, which is 84% of the single-cell calls (A color version of this figure appears in the online version of this article) Clustered heatmap of the mutations detected by Monovar from TNBC data and population sequencing data of tumor and matched normal tissues. The pixels show mutation calls (dark blue), reference alleles (light blue) and missing data (pink). Columns (cells) are colored according to the ploidy of the cells. The colors of the rows (genes) indicate whether the SNV was found in bulk data or not. Here, popT and popN are the tumor and normal population sequencing samples, respectively. Monovar detected 18 187 high or moderate non-synonymous mutations in the single-cell data from which 10 981 mutations were found in the bulk data as well which is 60% of the single-cell calls (A color version of this figure appears in the online version of this article) The second biological data consists of 16 neuron cells on which scWGS was performed to study somatic mutations in human brain development (Evrony ). Applying SCI’s statistic test on the input mpileup identified 2 489 545 candidate loci. We ran Phylovar with five parallel hill-climbing chains, each for 50 000 iterations on 5 CPUs with 192 GB RAM. Phylovar finished the process after 17 h and 45 min. To compare our results with other methods, we ran Monovar and SCI with default parameters. Monovar processed the data in 10 h and 26 min, while SCI was still running after ten days. Phylovar’s inferred false-positive and false-negative error rates were 75.22% and 1.17%, respectively. snpEff identified 5745 non-synonymous SNVs among Phylovar’s mutation calls. Figure 6 shows hierarchical clustering on the genotypes of Phylovar, Monovar and the initial genotypes at sites with non-synonymous SNVs. We observed similarities between Monovar’s calls and the initial genotypes. By comparing the panel of Phylovar in Figure 6 with the other panels, one can see the sparse regions of mutations in panels of Monovar and the initial genotype matrix that are inferred as reference alleles by Phylovar.

Fig. 6.

Clustered heatmaps of mutation calls by different methods performed on neuron cells. Rows and columns represent the genomic loci and the single cells, respectively. The pixels show mutation calls (dark blue), reference alleles (light blue) and missing data (pink). The initial genotypes are the initial estimates of genotypes considering no error rates and no underlying phylogeny at the starting step of Phylovar’s search algorithm (A color version of this figure appears in the online version of this article) Furthermore, we investigated the genes likely related to neurodegenerative diseases by comparing our findings with the genes reported in Wei ), where somatic mutations were studied in 1461 control and diseased human brains with different neurodegenerative disorders. Among the genes inferred by Phylovar to harbor non-synonymous mutations, 12 genes were reported in Wei . We observed that the genes MUC16 and MLIP were frequently mutated in different regions; also, non-synonymous SNVs were observed within KRT33A and SEMA5B in Wei from patients with Creutzfeldt-Jakob and Alzheimer diseases, respectively. The presence of these non-synonymous mutations in both diseased and normal samples implies the high mutability of these genes even in a healthy individual.

4 Conclusions

The rapid growth of SCS technologies poses computational challenges due to the increasing number of cells and sites sequenced per genome (Lähnemann ). In this work, we focused on addressing the computational challenge associated with the breadth of genomic sites in scDNAseq data. Here, we introduced Phylovar, a scalable MLE method for phylogeny-guided inference of SNVs from single-cell DNA sequencing data suitable for scWGS and scWES data with an extensive number of loci. We introduced a novel vectorized formula for likelihood calculation, making Phylovar scalable to hundreds of thousands, even millions of loci. We assessed Phylovar’s performance against state-of-the-art variant callers SCI (Singer ) and Monovar (Zafar ), through simulated benchmarks. Phylovar outperforms SCI in terms of running time while being more accurate than Monovar in different simulation scenarios. We also applied Phylovar to two real biological datasets. For a TNBC dataset with 32 single cells and 3375 candidate loci, Phylovar identified SNVs with functional impact among which 84% were supported by bulk sequencing data. Phylovar was also more accurate than Monovar and 6.5x faster than that of SCI. For a larger dataset containing 16 normal human neuron cells and approximately 2.5 million candidate loci, Phylovar identified 5745 non-synonymous SNVs some of which were related to neurodegenerative diseases. Interestingly, Phylovar detected 75.22% false-positive, and 1.17% false-negative error rate for this dataset. The neuron cells data was particularly challenging due to large number of sites. For this data, SCI failed to converge even after ten days of running while Phylovar terminated after less than 18 h. Phylovar makes it possible to analyze datasets with large number of loci within reasonable time and memory requirements, thus adding to the growing toolbox for analyzing scDNAseq data. As a direction for future research, we will explore deviations from the simplified ISA model and investigate the feasibility of applying more general finite-sites models (FSM) (Zafar , 2019) to datasets with many loci. As scDNAseq technologies advance, the sequencing cost per cell decreases (Lim ; Tang ). Consequently, we expect more scWGS and scWES datasets to emerge in the future, requiring methods such as Phylovar that can perform scalable variant calling on datasets with millions of loci.

27 in total

1. Whole-genome multiple displacement amplification from single cells.

Authors: Claudia Spits; Cédric Le Caignec; Martine De Rycke; Lindsey Van Haute; André Van Steirteghem; Inge Liebaers; Karen Sermon
Journal: Nat Protoc Date: 2006-11-30 Impact factor: 13.491

2. Monovar: single-nucleotide variant detection in single cells.

Authors: Hamim Zafar; Yong Wang; Luay Nakhleh; Nicholas Navin; Ken Chen
Journal: Nat Methods Date: 2016-04-18 Impact factor: 28.547

Review 3. Advancing Cancer Research and Medicine with Single-Cell Genomics.

Authors: Bora Lim; Yiyun Lin; Nicholas Navin
Journal: Cancer Cell Date: 2020-04-13 Impact factor: 31.743

4. Genome-wide detection of single-nucleotide and copy-number variations of a single human cell.

Authors: Chenghang Zong; Sijia Lu; Alec R Chapman; X Sunney Xie
Journal: Science Date: 2012-12-21 Impact factor: 47.728

5. mRNA-Seq whole-transcriptome analysis of a single cell.

Authors: Fuchou Tang; Catalin Barbacioru; Yangzhou Wang; Ellen Nordman; Clarence Lee; Nanlan Xu; Xiaohui Wang; John Bodeau; Brian B Tuch; Asim Siddiqui; Kaiqin Lao; M Azim Surani
Journal: Nat Methods Date: 2009-04-06 Impact factor: 28.547

6. PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors.

Authors: Amit G Deshwar; Shankar Vembu; Christina K Yung; Gun Ho Jang; Lincoln Stein; Quaid Morris
Journal: Genome Biol Date: 2015-02-13 Impact factor: 13.583

7. Sequencing error profiles of Illumina sequencing instruments.

Authors: Nicholas Stoler; Anton Nekrutenko
Journal: NAR Genom Bioinform Date: 2021-03-27

Review 8. Cancer genomics: one cell at a time.

Authors: Nicholas E Navin
Journal: Genome Biol Date: 2014-08-30 Impact factor: 13.583