| Literature DB >> 30839919 |
Nicola Prezza1, Nadia Pisanti1,2, Marinella Sciortino3, Giovanna Rosone1.
Abstract
BACKGROUND: Sequencing technologies keep on turning cheaper and faster, thus putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein. In this view, there is a growing interest in alignment-free and reference-free variants calling methods that only make use of (suitably indexed) raw reads data.Entities:
Keywords: Assembly-free; BWT; LCP array; Reference-free; SNPs
Year: 2019 PMID: 30839919 PMCID: PMC6364478 DOI: 10.1186/s13015-019-0137-8
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Fig. 1Our complete pipeline, including pre-processing and post-processing phases
Pre-processing comparative results of ebwt2snp (i.e. building the eBWT using either eGSA or BCR) and DiscoSnp++ (i.e. building the de Bruijn graph)
| Dataset | Coverage per sample | #reads | Preprocessing | Wall clocka | RAM |
|---|---|---|---|---|---|
| HG00096 (ch. 22) | 29× | 15,000,000 | gsacak | 0:53:34 | 100,607 |
| eGSA | 1:41:37 | 30,720 | |||
| BCR | 4:18:00 | 1,970 | |||
| D | 0:01:09 | 5,170 | |||
| HG00100 (ch. 16) | 22× | 20,000,000 | gsacak | 1:13:04 | 112,641 |
| eGSA | 3:39:04 | 30,720 | |||
| BCR | 6:10:28 | 3,262 | |||
| D | 0:02:01 | 6,111 | |||
| HG00419+NA19017 (ch. 1) | 43×–47× | 93,657,983 | BCR | 105:28:30 | 73,977 |
| D | 0:32:37 | 621 |
Wall clock is the elapsed time from start to completion of the instance, while RAM is the peak Resident Set Size (RSS). Both values were taken with /usr/bin/time command. Note that for the last collection we have used a variant of BCR that keeps the in internal memory. eGSA and gsacak have not been tested on the last dataset since they required too much disk space and RAM, respectively
aWe recall that DiscoSnp++ makes use of multiple cores while ebwt2snp is currently designed to use one core only, thus explaining the difference in speed
Post-processing comparative results of ebwt2snp (i.e. building clusters from the eBWT and performing SNP calling) and DiscoSnp++ (i.e. running KisSNP2 and kissreads2 using the pre-computed de Bruijn graph)
| Tool | Param. | Wall clock | RAM | TP | FP | FN | SEN (%) | PREC (%) | Non-isol. |
|---|---|---|---|---|---|---|---|---|---|
| Individual HG00096 vs reference (chromosome 22, 50818468bp), coverage 29× per sample | |||||||||
| D | b = 0 | 5:07 | 101 | 32,773 | 3719 | 13,274 | 71.17 | 89.81 | 4707/8658 |
| b = 1 | 16:39 | 124 | 37,155 | 10,599 | 8892 | 80.69 | 77.80 | 5770/8658 | |
| b = 2 | 20:42 | 551 | 40,177 | 58,227 | 5870 | 87.25 | 40.83 | 6325/8658 | |
| e |
| 35:56 | 314 | 42,309 | 1487 | 3738 | 91.88 | 96.60 | 7233/8658 |
|
| 22:19 | 300 | 40,741 | 357 | 5306 | 88.47 | 99.13 | 6884/8658 | |
| Individual HG00100 vs reference (chromosome 16, 90338345bp), coverage 22× per sample | |||||||||
| D | b=0 | 6:20 | 200 | 48,119 | 10,226 | 18,001 | 72.78 | 82.47 | 6625/11,055 |
| b=1 | 31:57 | 208 | 53,456 | 24,696 | 12,664 | 80.85 | 68.40 | 7637/11,055 | |
| b=2 | 51:45 | 1256 | 57,767 | 124,429 | 8353 | 87.37 | 31.71 | 8307/11,055 | |
| e |
| 33:24 | 418 | 59,668 | 898 | 6452 | 90.24 | 98.51 | 9287/11,055 |
|
| 44:53 | 337 | 53,749 | 190 | 12,371 | 81.29 | 99.64 | 8169/11,055 | |
Wall clock (mm:ss) is the elapsed time from start to completion of the instance, while RAM is the peak Resident Set Size (RSS). Both values were taken with /usr/bin/time command. We recall that DiscoSnp++ makes use of multiple cores while ebwt2snp is currently designed to use one core only, thus explaining the difference in speed
Sensitivity and precision of the ebwt2snp pipeline
| Cov | SEN (%) | PREC (%) | TP | FP | FN | Non-isol | Non isol (%) |
|---|---|---|---|---|---|---|---|
| 3 | 84.34 | 45.19 | 317,490 | 385,060 | 58,938 | 45,363 | 65.34 |
| 4 | 83.18 | 50.67 | 313,131 | 304,811 | 63,297 | 44,491 | 64.08 |
| 5 | 80.53 | 60.36 | 303,130 | 199,042 | 73,298 | 42,394 | 61.06 |
| 6 | 77.94 | 66.62 | 293,385 | 146,972 | 83,043 | 40,403 | 58.20 |
| 7 | 75.22 | 70.93 | 283,145 | 116,042 | 93,283 | 38,405 | 55.32 |
| 8 | 72.32 | 73.99 | 272,223 | 95,675 | 104,205 | 36,427 | 52.47 |
| 9 | 69.18 | 76.33 | 260,405 | 80,746 | 116,023 | 34,391 | 49.54 |
| 10 | 65.80 | 78.16 | 247,685 | 69,203 | 128,743 | 32,281 | 46.50 |
| 11 | 59.83 | 79.82 | 225,232 | 56,929 | 151,196 | 28,846 | 41.55 |
| 12 | 55.45 | 81.21 | 208,725 | 48,284 | 167,703 | 26,360 | 37.97 |
Values are computed using as ground truth the SNPs predicted by a classic aligner-based pipeline
Sensitivity and precision of the DiscoSnp++ pipeline
| b | Wall clock | RAM (MB) | SEN (%) | PREC (%) | TP | FP | FN | Non-isol | Non isol (%) |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 00:42:46 | 608 | 62.62 | 87.21 | 235,749 | 34,547 | 140,679 | 18,561 | 26.73 |
| 1 | 02:13:23 | 866 | 71.94 | 74.57 | 270,811 | 92,310 | 105,617 | 31,640 | 45.57 |
| 2 | 11:09:09 | 13,830 | 77.31 | 45.41 | 291,022 | 349,754 | 85,406 | 34,840 | 50.18 |
Values are computed using as ground truth the SNPs predicted by a classic aligner-based pipeline