| Literature DB >> 31757199 |
Yoshihiro Shibuya1,2, Matteo Comin3.
Abstract
MOTIVATION: Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling.Entities:
Keywords: BWT; FASTQ compression; FM-Index
Mesh:
Year: 2019 PMID: 31757199 PMCID: PMC6873394 DOI: 10.1186/s12859-019-2883-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Example of smoothing performed by YALFF. A mismatch in one of the k-mers is enough to keep the corresponding quality value unchanged
Fig. 2The threshold excludes k-mers 1 to 5 to be skipped by the algorithm
Fig. 3Example showing the threshold mechanism introduced to trim the low quality bases of a read. In this case only two k-mers are queried
Fig. 4An example of quality smoothing by YALFF including both mismatches with the k-mers DB and low quality values
Fig. 5An overview of YALFF’s inner structure
Comparison of various metrics T.P., F.P., F.N., Precision, Recall and F-Measure in SNP calling between different tools
| Smoothing algorithm | T.P. | F.P. | F.N. | Precision | Recall | F-Measure |
|---|---|---|---|---|---|---|
| None (original files) | 2588159 | 219803 | 1493731 | 0.9217 | 0.6341 | 0.7513 |
| YALFF | 2603620 | 221368 | 1478264 | 0.9216 | 0.6378 | 0.7539 |
| Quartz | 2661218 | 237820 | 1420672 | 0.9180 | 0.6520 | 0.7624 |
| Leon | 2278517 | 204803 | 1803366 | 0.9175 | 0.5582 | 0.6941 |
| Illumina 8bin | 2546518 | 216128 | 1535370 | 0.9218 | 0.6239 | 0.7441 |
| Pblock p =2 | 2574111 | 218405 | 1507773 | 0.9218 | 0.6306 | 0.7489 |
| Pblock p =4 | 2558612 | 216995 | 1523273 | 0.9218 | 0.6268 | 0.7462 |
| Rblock t =1.1 | 2550179 | 216738 | 1531706 | 0.9217 | 0.6248 | 0.7447 |
| Rblock t =1.15 | 2526704 | 215721 | 1555181 | 0.9213 | 0.6190 | 0.7405 |
| QVZ 0.6 | 2588704 | 225730 | 1493180 | 0.9198 | 0.6342 | 0.7507 |
| QVZ 0.8 | 2588773 | 223210 | 1493112 | 0.9206 | 0.6342 | 0.7510 |
Fig. 6ROC curves of SNPs calling for various methods
Fig. 7Histogram showing the total execution time in hours and peak RAM usage of the different programs
Fig. 8Time taken by YALFF to smooth a FASTQ file as a function of the number of cores
Compression ratio for the different smoothing tools and compressors. The ratio is defined as where the uncompressed size is 42GB
| Smoothing algorithm | gzip | bzip2 | xz (LZMA) |
|---|---|---|---|
| None (original files) | 4.617 | 5.152 | 5.918 |
| YALFF | 7.147 | 7.633 | 9.186 |
| Quartz | 6.925 | 7.349 | 8.827 |
| Leon | 7.098 | 7.551 | 8.988 |
| Illumina 8bin | 6.054 | 6.742 | 7.819 |
| Pblock p =2 | 5.373 | 5.966 | 7.011 |
| Pblock p =4 | 6.052 | 6.647 | 7.671 |
| Rblock t =1.1 | 6.285 | 6.859 | 7.941 |
| Rblock t =1.15 | 6.675 | 7.250 | 8.443 |
| QVZ 0.6 | 4.776 | 5.533 | 6.395 |
| QVZ 0.8 | 4.778 | 5.510 | 6.366 |
The impact of the parameters of YALFF for various metrics T.P., F.P., F.N., Precision, Recall, F-Measure, Compression (LZMA) and Time (min.)
| Parameters | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| k | L.T. | H.T. | T.P. | F.P. | F.N. | Prec. | Recall | F-M. | Compr. | Time |
| 16 | 6 | 40 | 2659170 | 276596 | 1422714 | 0.9058 | 0.6515 | 0.7579 | 10.107 | 11850 |
| 32 | 6 | 40 | 2603620 | 221368 | 1478264 | 0.9216 | 0.6378 | 0.7539 | 9.186 | 2934 |
| 48 | 6 | 40 | 2588254 | 220176 | 1493631 | 0.9216 | 0.6341 | 0.7513 | 5.936 | 509 |
| 32 | 0 | 40 | 2626957 | 253747 | 1454928 | 0.9119 | 0.6436 | 0.7546 | 9.181 | 2657 |
| 32 | 3 | 40 | 2603891 | 221696 | 1477993 | 0.9215 | 0.6379 | 0.7539 | 9.113 | 2315 |
| 32 | 12 | 40 | 2601463 | 219417 | 1480421 | 0.9222 | 0.6373 | 0.7537 | 8.813 | 2239 |
| 32 | 3 | 30 | 2616530 | 225716 | 1465356 | 0.9206 | 0.6410 | 0.7558 | 9.597 | 565 |
| 32 | 3 | 35 | 2612145 | 223372 | 1469743 | 0.9212 | 0.6399 | 0.7552 | 9.429 | 848 |
| 32 | 3 | 37 | 2609021 | 222494 | 1472866 | 0.9214 | 0.6392 | 0.7548 | 9.115 | 1279 |