| Literature DB >> 29992288 |
James K Bonfield1, Shane A McCarthy1,2, Richard Durbin1,2.
Abstract
Motivation: The bulk of space taken up by NGS sequencing CRAM files consists of per-base quality values. Most of these are unnecessary for variant calling, offering an opportunity for space saving.Entities:
Mesh:
Year: 2019 PMID: 29992288 PMCID: PMC6330002 DOI: 10.1093/bioinformatics/bty608
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Effect of lossy quality compression on 50× and 15× Syndip data using GATK HaplotypeCaller
| Category | Original | Original F | Crumble-1 | Crumble-1 F | Crumble-9p8 | Crumble-9p8 F | Crumble* | Crumble* F |
|---|---|---|---|---|---|---|---|---|
| 50× Qual size (MB) | 4107 | — | 614 | — | 235 | — | 229 | — |
| 50× SNP False Positive | 6226 | 2968 | –359 | –79 | –251 | –67 | –526 | –181 |
| 50× SNP False Negative | 4648 | 7625 | 0 | –53 | –25 | –184 | +41 | –123 |
| 50× Indel False Positive | 3965 | 3649 | –7 | –41 | +19 | +9 | –35 | –32 |
| 50× Indel False Negative | 7881 | 7972 | +7 | +11 | –103 | –82 | –93 | –72 |
| 15× Qual size (MB) | 1211 | – | 260 | — | 77 | — | 72 | — |
| 15× SNP False Positive | 4798 | 2517 | –10 | +63 | +347 | +225 | –359 | –29 |
| 15× SNP False Negative | 14985 | 27761 | –205 | –297 | –3027 | –4608 | –1866 | –2865 |
| 15× Indel False Positive | 2781 | 2521 | +2 | –14 | +109 | +60 | +53 | +26 |
| 15× Indel False Negative | 13136 | 13925 | –8 | +5 | –484 | –427 | –444 | –410 |
Note: Comparison of unfiltered and filtered (marked with ‘F’) calls on the Syndip truth set. GATK filtering rules are listed in the Supplementary Material. Crumble* refers to parameters optimized for this dataset: ‘crumble -9p8 -u30 -Q60 -D100’. The false positive/negative values of the GATK calls on the crumbled dataset are shown relative to their respective GATK called lossless dataset. The truth set for Chromosome 1 has 269 655 SNPs and 46 036 indels, counting multi-allelic sites once per allele. The quality sizes are absolute for all files.