| Literature DB >> 35782734 |
Monica Valecha1,2, David Posada1,2,3.
Abstract
Single-cell sequencing has gained popularity in recent years. Despite its numerous applications, single-cell DNA sequencing data is highly error-prone due to technical biases arising from uneven sequencing coverage, allelic dropout, and amplification error. With these artifacts, the identification of somatic genomic variants becomes a challenging task, and over the years, several methods have been developed explicitly for this type of data. Single-cell variant callers implement distinct strategies, make different use of the data, and typically result in many discordant calls when applied to real data. Here, we review current approaches for single-cell variant calling, emphasizing single nucleotide variants. We highlight their potential benefits and shortcomings to help users choose a suitable tool for their data at hand.Entities:
Keywords: ADO, allelic dropout; Allele dropout; Amplification error; CNV, copy number variant; Indel, short insertion or deletion; LDO, locus dropout; SNV, single nucleotide variant; SV, structural variant; Single-cell genomics; Somatic variants; VAF, variant allele frequency; Variant calling; hSNP, heterozygous single-nucleotide polymorphism; scATAC-seq, single-cell sequencing assay for transposase-accessible chromatin; scDNA-seq, single-cell DNA sequencing; scHi-C, single-cell Hi-C sequencing; scMethyl-seq, single-cell Methylation sequencing; scRNA-seq, single-cell RNA sequencing; scWGA, single-cell whole-genome amplification
Year: 2022 PMID: 35782734 PMCID: PMC9218383 DOI: 10.1016/j.csbj.2022.06.013
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1Single-cell whole-genome amplification biases. Technical biases arising during single-cell whole genome amplification can be detected from the sequencing reads, like allele dropout (ADO) (1), allelic imbalance (AI) (2), locus dropout (LDO) (3), and amplification errors (4). Coverage breadth and depth are more heterogeneous for single-cell compared to bulk sequencing.
Methodological strategies and assumptions of scDNA-seq variant callers. Tools can use all cells simultaneously (joint calling) or do mutation calling cell by cell (marginal calling). Some callers use phylogenetic information or follow the infinite-sites assumption. The allelic imbalance and the amplification error are assumed to be constant across the genome (global) or not (local). Some tools use linked hSNPs to identify errors.
| Calling | Phylogeny | Infinite sites assumption | Allelic Imbalance/dropout | Amplification error | Linked hSNPs | |
|---|---|---|---|---|---|---|
| Monovar | joint | no | no | global | global | no |
| SCcaller | marginal | no | no | local | global | no |
| SCIΦ | joint | yes | yes | global | global | no |
| LiRA | marginal | no | no | local | local | yes |
| Conbase | joint | no | no | local | local | yes |
| SCAN-SNV | joint | no | no | local | local | no |
| scVILP | joint | yes | yes | global | global | no |
| ProSolo | marginal | no | no | local | local | no |
| SCIΦN | joint | yes | no | global | global | no |
| Phylovar | joint | yes | yes | global | global | no |
Capabilities of scDNA-seq variant callers. All callers identify somatic variants, whereas some also include germline variants and indels in their output. Some callers can also give homozygous mutant genotypes and impute missing genotypes. Most callers can call singletons (mutations that appear just in one cell), and SCAN-SNV also detects doublets (pairs of cells erroneously treated as a single cell).
| Germline calls | Somatic calls | Indels | Homozygous mutations | Genotype imputation | Call | Detects doublets | |
|---|---|---|---|---|---|---|---|
| Monovar | yes | yes | no | yes | no | yes | no |
| SCcaller | no | yes | yes | yes | no | yes | no |
| SCIΦ | no | yes | no | no | yes | yes | no |
| LiRA | no | yes | no | no | no | yes | no |
| Conbase | no | yes | no | yes | no | no | no |
| SCAN-SNV | no | yes | no | yes | no | yes | yes |
| scVILP | yes | yes | no | no | yes | yes | no |
| ProSolo | yes | yes | no | yes | yes | yes | no |
| SCIΦN | no | yes | no | no | yes | yes | no |
| Phylovar | yes | yes | no | no | no | yes | no |
Technical features of scDNA-seq variant callers. The input formats can be BAM (https://samtools.github.io/hts-specs/SAMv1.pdf) or mpileup (http://www.htslib.org/doc/samtools-mpileup.html). All tools require a reference human genome, and some of them also need a set of candidate SNVs and SNPs, normal/tumor bulk samples, or a dbSNP file (https://www.ncbi.nlm.nih.gov/snp). The output format is VCF (https://samtools.github.io/hts-specs/VCFv4.2.pdf), its binary counterpart BCF, TSV (tab-separated values), or RDA (R data file).
| Input format | Other input files | Bulk sample | dbSNP | Output | Computer language | |
|---|---|---|---|---|---|---|
| Monovar | BAM | Ref. genome | no | no | VCF | Python |
| SCcaller | BAM | Ref. genome | normal | yes | VCF | Python |
| SCIΦ | Mpileup | Ref. genome | normal | no | VCF | C++ |
| LiRA | BAM | Ref. genome, candidate SNVs | normal | yes | VCF | Python, R |
| Conbase | BAM | Ref. genome, SNPs | normal | no | TSV | Python |
| SCAN-SNV | BAM | Ref. genome | normal | yes | RDA | Python, R |
| scVILP | Mpileup | Ref. genome | no | no | VCF | Python, C++ |
| ProSolo | BAM | Ref. genome, candidate SNVs | tumor | no | BCF | Python |
| SCIΦN | Mpileup | Ref. genome | normal | no | VCF | C++ |
| Phylovar | Mpileup | Ref. genome | no | no | VCF | Python |
References and URLs for scDNA-seq variant callers.
| Reference | URL | |
|---|---|---|
| Monovar | Zafar et al., 2016 | |
| SCcaller | Dong et al., 2017 | |
| SCIΦ | Singer et al., 2018 | |
| LiRA | Bohrson et al., 2019 | |
| Conbase | Hård et al., 2019 | |
| SCAN-SNV | Luquette et al., 2019 | |
| scVILP | Edrisi et al., 2019 | |
| ProSolo | Lähnemann et al., 2021 | |
| SCIΦN | Kuipers et al., 2022 | |
| Phylovar | Edrisi et al., 2022 |
Fig. 2SNV assessment with linked hSNPs. Variant alleles at SNVs and linked hSNPs should appear consistently in the same reads if they occur in the same chromosome in the original cell genome (cis configuration). On the contrary, they should appear consistently in different reads if they occur in different chromosomes in the original cell genome (trans configuration). During cell lysis or scWGA in cis with the hSNP variant alleles, errors will appear exclusively on a fraction of the reads that carry the hSNP alternate allele. In contrast, errors in trans with the hSNP variant alleles will appear exclusively on a fraction of the reads that do not carry the hSNP alternate allele. ADO becomes evident when all or none of the linked reads carry the hSNP alternate allele.
Fig. 3Runtime for single-cell SNV callers. Plot showing run times for scDNA-seq variant callers on a dataset with 24 single-cell whole-genomes. Colors highlight distinct callers. The X-axis represents four different job-splitting strategies (note that different tools have different capabilities in this regard). The Y-axis is in log-scale and represents the maximum number of hours required by a given tool.