| Literature DB >> 32938358 |
Nicola Prezza1, Nadia Pisanti1, Marinella Sciortino2, Giovanna Rosone3.
Abstract
BACKGROUND: In [Prezza et al., AMB 2019], a new reference-free and alignment-free framework for the detection of SNPs was suggested and tested. The framework, based on the Burrows-Wheeler Transform (BWT), significantly improves sensitivity and precision of previous de Bruijn graphs based tools by overcoming several of their limitations, namely: (i) the need to establish a fixed value, usually small, for the order k, (ii) the loss of important information such as k-mer coverage and adjacency of k-mers within the same read, and (iii) bad performance in repeated regions longer than k bases. The preliminary tool, however, was able to identify only SNPs and it was too slow and memory consuming due to the use of additional heavy data structures (namely, the Suffix and LCP arrays), besides the BWT.Entities:
Keywords: Alignment-free; Assembly-free; BWT; INDEL; SNP
Mesh:
Year: 2020 PMID: 32938358 PMCID: PMC7493873 DOI: 10.1186/s12859-020-03586-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Strategy for finding SNPs/INDELs. 1 Underlyng (unknown) genotype, including an INDEL. 2 Input reads sequenced from the genotype (including sequencing errors). 3 eBWT, LCP, and contexts preceding (LEFT) and following (RIGHT) each eBWT character. In bold: LCP minima. In gray: eBWT cluster. Note that we explicitly compute only column eBWT (the other columns are shown only for illustrative purposes). LCP minima are computed on-the-fly, whereas contexts LEFT and RIGHT are reconstructed using backward search and the FL mapping, respectively. 4 Output INDEL , extended by one nucleotide to the left and two to the right. Note that the output INDEL is left-shifted, whereas originally (in the unknown genotype) it was right-shifted. To call the INDEL, we (i) compute (via backward search) the two consensus sequences AT and ATGC of the two alleles’ left-contexts (i.e. the strings obtained by concatenating symbols in LEFT and eBWT), and (ii) align them, possibly allowing an INDEL to their right-end. In the figure, the best alignment is the one that deletes GC from ATGC. SNPs are computed similarly, the only difference being that the best alignment of the left-contexts does not introduce insertions nor deletions
Fig. 2Simulated SNP detection. SNP sensitivity, precision and F1 score on synthetic data as a function of the dataset’s coverage
Fig. 3Simulated INDEL detection. INDEL sensitivity, precision and F1 score on synthetic data as a function of the dataset’s coverage
Running times on real data
| coverage | BCR (BWT) | ||
|---|---|---|---|
| 10 | 1:03:02 | 0:51:05 | 00:54:07 |
| 20 | 2:08:52 | 1:24:00 | 01:09:06 |
| 30 | 3:19:18 | 2:20:14 | 01:21:31 |
| 40 | 4:22:06 | 2:55:45 | 01:37:41 |
| 48 | 5:11:35 | 3:57:26 | 01:42:50 |
We also show the times required to build the BWT using the tool BCR. All tools were run using one core only
Results on the 30x-covered real dataset
| metric | ||
|---|---|---|
| SEN SNP | 0.791231 | 0.641049 |
| PREC SNP | 0.596384 | 0.784806 |
| SEN INDEL | 0.547036 | 0.425699 |
| PREC INDEL | 0.533956 | 0.571847 |
| F1 SNP | 0.680127 | 0.705681 |
| F1 INDEL | 0.540417 | 0.488067 |