| Literature DB >> 23758828 |
Idoia Ochoa1, Himanshu Asnani, Dinesh Bharadia, Mainak Chowdhury, Tsachy Weissman, Golan Yona.
Abstract
BACKGROUND: Next Generation Sequencing technologies have revolutionized many fields in biology by reducing the time and cost required for sequencing. As a result, large amounts of sequencing data are being generated. A typical sequencing data file may occupy tens or even hundreds of gigabytes of disk space, prohibitively large for many users. This data consists of both the nucleotide sequences and per-base quality scores that indicate the level of confidence in the readout of these sequences. Quality scores account for about half of the required disk space in the commonly used FASTQ format (before compression), and therefore the compression of the quality scores can significantly reduce storage requirements and speed up analysis and transmission of sequencing data.Entities:
Mesh:
Year: 2013 PMID: 23758828 PMCID: PMC3698011 DOI: 10.1186/1471-2105-14-187
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
MSE results of QualComp when applied to the dataset
| 0 | 1 | 32.71 |
| 0 | 3 | 18.62 |
| 0 | 5 | 15.13 |
| 0.2 | 1 | 10.67 |
| 0.2 | 3 | 8.75 |
| 0.2 | 5 | 8.37 |
| 0.5 | 1 | 7.23 |
| 0.5 | 3 | 5.94 |
| 0.5 | 5 | 5.70 |
| 1.0 | 1 | 4.49 |
| 1.0 | 3 | 3.63 |
| 1.0 | 5 | 3.47 |
| 2.0 | 1 | 2.05 |
| 2.0 | 3 | 1.62 |
| 2.0 | 5 | 1.54 |
| 2.5 | 1 | 1.42 |
| 2.5 | 3 | 1.12 |
| 2.5 | 5 | 1.06 |
| 3.0 | 1 | 1.03 |
| 3.0 | 3 | 0.89 |
| 3.0 | 5 | 0.83 |
MSE obtained by our lossy compression algorithm for different rates (R) and number of clusters (C) with the PhiX dataset. As can be observed, increasing the number of clusters decreases the MSE for the same rate.
Figure 1MSE vs. compression rate for the dataset. Results of QualComp when applied to the PhiX dataset for rates R={0,0.2,0.5,1,2,2.5,3}, and 1,2 and 3 clusters. As can be observed, increasing the number of clusters improves the performance of QualComp in terms of the MSE.
Rate and MSE obtained by running the algorithm proposed in [39] on the dataset
| 60 | 0 | 836.54 | | 2 | 0.08 | 629.25 | | 33 | 0.30 | 189.63 |
| 32 | 0.78 | 352.20 | | 4 | 0.10 | 493.59 | | 40 | 0.35 | 165.27 |
| 30 | 0.76 | 207.50 | | 6 | 0.11 | 452.24 | | 60 | 0.42 | 142.76 |
| 25 | 0.63 | 102.14 | | 10 | 0.15 | 339.58 | | 70 | 0.50 | 122.08 |
| 20 | 0.41 | 118.67 | | 20 | 0.22 | 243.96 | | 80 | 0.59 | 103.19 |
| 15 | 0.9 | 39.86 | | 30 | 0.26 | 215.86 | | 90 | 0.59 | 103.19 |
| 10 | 1.09 | 17.67 | | 60 | 0.42 | 142.76 | | | | |
| 6 | 1.36 | 8.13 | | 70 | 0.50 | 122.08 | | | | |
| 4 | 1.90 | 2.92 | | 80 | 0.59 | 103.19 | | | | |
| 2 | 2.74 | 0.54 | 90 | 0.59 | 103.19 |
MSE obtained by the LogBinning, UniBinning and Truncating schemes proposed in [39] with different parameters.
Rate and MSE obtained by running SCALCE [35] on the dataset
| 60 | 0.84 | 7.87 |
| 70 | 0.77 | 6.06 |
| 80 | 0.77 | 6.06 |
| 90 | 1.02 | 5.99 |
| 100 | 0.96 | 5.59 |
| 40 | 1.49 | 3.03 |
| 20 | 2.25 | 0.55 |
| 0 | 2.95 | 0 |
MSE obtained by SCALCE with different error thresholds.
Figure 2Comparison of the MSE of different compression methods on the dataset. Comparison between the MSE obtained by QualComp and the schemes proposed in [39] and SCALCE [35] for different rates. Note that for small rates QualComp presents the smallest MSE, and it achieves rates not attainable by other lossy compression algorithms.
MSE results of QualComp when applied to the dataset
| 0 | 1 | 0 | 143.0 |
| 0 | 2 | 0 | 37.58 |
| 0 | 3 | 0 | 27.49 |
| 0.33 | 1 | 0.3343 | 16.46 |
| 0.33 | 2 | 0.3420 | 14.39 |
| 0.33 | 3 | 0.3459 | 13.09 |
| 0.66 | 1 | 0.6685 | 12.94 |
| 0.66 | 2 | 0.6811 | 11.00 |
| 0.66 | 3 | 0.6825 | 9.82 |
| 1.00 | 1 | 1.0026 | 10.01 |
| 1.00 | 2 | 1.0153 | 8.59 |
| 1.00 | 3 | 1.0278 | 7.35 |
| 2.00 | 1 | 2.0051 | 4.58 |
| 2.00 | 2 | 2.0177 | 3.76 |
| 2.00 | 3 | 2.0300 | 3.12 |
MSE obtained by our lossy compression algorithm for different rates (R) and number of clusters (C) on the M. musculus dataset. R’ denotes the actual rate obtained after compression. Note that the effective rate R’, while still close to R, grows with the number of clusters.
Rate and MSE obtained by running the algorithm proposed in [39] on the dataset
| 60 | 0 | 684.29 | | 5 | 0.25 | 405.14 | | 33 | 0.01 | 684.29 |
| 34 | 0.29 | 632.13 | | 10 | 0.35 | 279.63 | | 40 | 0.26 | 404.92 |
| 26 | 0.65 | 80.160 | | 17 | 0.41 | 226.08 | | 50 | 0.58 | 137.08 |
| 17 | 0.62 | 129.42 | | 26 | 0.47 | 178.60 | | 60 | 1.58 | 15.01 |
| 10 | 1.13 | 14.51 | | 34 | 0.51 | 157.10 | | 70 | 3.24 | 0.00 |
| 5 | 1.42 | 6.03 | | 60 | 0.61 | 118.58 | | | | |
| | | | | 70 | 0.67 | 101.53 | | | | |
| | | | | 80 | 0.74 | 85.92 | | | | |
| | | | | 90 | 0.74 | 85.92 | | | | |
| | | | | 100 | 0.79 | 71.82 | | | | |
| 200 | 1.08 | 37.53 |
MSE obtained by the LogBinning, UniBinning and Truncating schemes proposed in [39] for different parameters.
Rate and MSE obtained by running SCALCE [35] and fastqz [34] on the dataset
| 90 | 0.80 | 7.47 | 70 | 0.05 | 596.15 | |
| 80 | 0.80 | 7.47 | 50 | 0.05 | 596.15 | |
| 100 | 0.93 | 6.62 | 40 | 0.05 | 596.15 | |
| 60 | 0.96 | 5.36 | 30 | 0.67 | 234.36 | |
| 40 | 1.29 | 3.31 | 20 | 0.46 | 76.19 | |
| 20 | 1.94 | 1.18 | 10 | 1.04 | 17.83 | |
| 0 | 2.65 | 0 | 5 | 1.33 | 3.53 | |
| 1 | 2.59 | 0 | ||||
MSE obtained by SCALCE with different error thresholds and by fastqz with different quantization levels.
Figure 3Comparison of the MSE of different compression methods on the dataset. Comparison between the MSE obtained by QualComp and the schemes proposed in [39], SCALCE [35] and fastqz [34] for different rates. Note that QualComp presents the smallest MSE for small rates.
MSE results of QualComp when applied to the dataset
| 0 | 1 | 75.64 |
| 0 | 2 | 25.21 |
| 0 | 3 | 17.32 |
| 0.25 | 1 | 12.55 |
| 0.25 | 2 | 8.53 |
| 0.25 | 3 | 7.26 |
| 0.50 | 1 | 9.18 |
| 0.50 | 2 | 7.17 |
| 0.50 | 3 | 5.90 |
| 1.00 | 1 | 6.53 |
| 1.00 | 2 | 5.42 |
| 1.00 | 3 | 4.16 |
| 2.00 | 1 | 3.50 |
| 2.00 | 2 | 3.02 |
| 2.00 | 3 | 1.99 |
MSE obtained by our lossy compression algorithm for different rates (R) and number of clusters (C) with the H. sapiens dataset.
Rate and MSE obtained by running the schemes proposed in [39] on the dataset
| 60 | 0 | 1346.35 | 5 | 0.05 | 895.92 | 33 | 0.01 | 1346.35 |
| 40 | 0.72 | 684.87 | 10 | 0.07 | 680.27 | 44 | 0.05 | 895.87 |
| 30 | 0.31 | 99.59 | 20 | 0.10 | 538.43 | 60 | 0.39 | 119.04 |
| 20 | 0.84 | 143.35 | 30 | 0.11 | 494.93 | | | |
| 10 | 1.07 | 27.80 | 40 | 0.13 | 413.58 | | | |
| 5 | 1.44 | 4.95 | 60 | 0.15 | 375.73 | | | |
| | | | 70 | 0.17 | 339.76 | | | |
| | | | 80 | 0.19 | 305.66 | | | |
| 90 | 0.19 | 305.66 |
MSE obtained by the LogBinning, UniBinning and Truncating algorithms proposed in [39] for different parameters.
Rate and MSE obtained by running SCALCE [35] and fastqz [34] on the dataset
| 100 | 0.5339 | 3.69 | 70 | 0.02 | 1208.02 |
| 90 | 0.64 | 2.58 | 50 | 0.02 | 1208.02 |
| 80 | 0.64 | 2.58 | 40 | 0.02 | 1208.02 |
| 60 | 0.85 | 1.67 | 20 | 0.10 | 246.37 |
| 40 | 0.98 | 1.18 | 30 | 0.28 | 87.99 |
| 20 | 1.55 | 0.96 | 10 | 0.36 | 43.86 |
| 0 | 2.04 | 0 | 5 | 0.82 | 6.31 |
| 1 | 2.04 | 0 | |||
MSE obtained by SCALCE with different error thresholds and by fastqz with different quantization levels.
Figure 4Comparison of the MSE of different compression methods on the dataset Comparison between the MSE obtained by QualComp and the schemes proposed in [39], SCALCE [35] and fastqz [34] for different rates.
Alignment accuracy on the dataset with and without compression
| | | | | | | ||
|---|---|---|---|---|---|---|---|
| | | | | | | ||
| 2.95 | 11315113 | 1179411 | 237852 | 89385 | 141300 | 347707 (2.61%) | 468.096 |
| | | | | | | ||
| 0 | 11315113 | 1178443 | 237493 | 67493 | 298 | 511928 (3.84%) | 0.097 |
| 0.20 | 11315113 | 1179059 | 237691 | 86662 | 90262 | 401981 (3.01%) | 32.097 |
| 0.50 | 11315113 | 1179153 | 237726 | 88051 | 100677 | 390048 (2.93%) | 80.097 |
| 1.00 | 11315113 | 1179233 | 237766 | 88771 | 109950 | 379935 (2.85%) | 159.097 |
| 2.00 | 11315113 | 1179304 | 237801 | 89177 | 120269 | 369104 (2.77%) | 318.097 |
| 2.50 | 11315113 | 1179318 | 237813 | 89250 | 123610 | 365664 (2.74%) | 397.097 |
| | | | | | | ||
| 0 | 11315113 | 1179104 | 237763 | 79908 | 100618 | 398262 (2.99%) | 0.285 |
| 0.20 | 11315113 | 1179221 | 237793 | 86486 | 120835 | 371320 (2.78%) | 32.411 |
| 0.50 | 11315113 | 1179268 | 237799 | 87857 | 124371 | 366360 (2.75%) | 81.185 |
| 1.00 | 11315113 | 1179298 | 237816 | 88621 | 128182 | 361738 (2.71%) | 159.985 |
| 2.00 | 11315113 | 1179346 | 237827 | 89108 | 132675 | 356699 (2.67%) | 318.585 |
| 2.50 | 11315113 | 1179362 | 237835 | 89204 | 134221 | 355033 (2.66%) | 398.385 |
| | | | | | | ||
| 0 | 11315113 | 1179057 | 237742 | 83060 | 110348 | 385448 (2.89%) | 0.476 |
| 0.20 | 11315113 | 1179239 | 237796 | 86437 | 121236 | 370947 (2.78%) | 32.551 |
| 0.50 | 11315113 | 1179283 | 237799 | 87858 | 124886 | 365829 (2.74%) | 80.606 |
| 1.00 | 11315113 | 1179321 | 237813 | 88664 | 128682 | 361175 (2.71%) | 160.376 |
| 2.00 | 11315113 | 1179363 | 237828 | 89146 | 133300 | 356018 (2.67%) | 319.270 |
| 2.50 | 11315113 | 1179364 | 237833 | 89230 | 134703 | 354525 (2.66%) | 400.176 |
Alignment results of Bowtie with the original PhiX FASTQ file and the ones reconstructed by QualComp, with different parameters. The first column specifies the rate, and the remaining ones the number of reads that are mapped to the reference genome with 0, 1, 2, 3 and more than 4 mismatches, and those that did not map. Last column shows the total size after compression. To compute the size of the quality scores in the original FASTQ file, we apply SCALCE [35] with lossless compression. Note that the number of reads that map with zero mismatches remain constant for all the choices of rate and number of clusters, and is equal to that of the original file.
SNP calling on the dataset with and without compression
| | | | | | | | |
|---|---|---|---|---|---|---|---|
| 0 | 143.0 | 11217 | 2033 | 1810 | 84.66 | 86.11 | 0.019 |
| 0.20 | 19.16 | 12585 | 1159 | 442 | 91.57 | 96.61 | 16.17 |
| 0.33 | 16.46 | 12602 | 1120 | 380 | 91.84 | 97.07 | 26.68 |
| 0.66 | 12.94 | 12669 | 998 | 358 | 92.70 | 97.25 | 53.34 |
| 1.00 | 10.01 | 12656 | 875 | 371 | 93.53 | 97.15 | 80.82 |
| 2.00 | 4.58 | 12733 | 594 | 294 | 95.54 | 97.74 | 161.62 |
| | | | | | | | |
| 0 | 37.58 | 12086 | 1534 | 941 | 88.73 | 92.77 | 0.039 |
| 0.20 | 16.42 | 12644 | 1184 | 383 | 91.44 | 97.06 | 16.19 |
| 0.33 | 14.39 | 12655 | 1107 | 372 | 91.95 | 97.14 | 26.70 |
| 0.66 | 11.00 | 12669 | 985 | 358 | 92.78 | 97.25 | 53.36 |
| 1.00 | 8.59 | 12687 | 830 | 340 | 93.85 | 97.39 | 80.84 |
| 2.00 | 3.76 | 12751 | 606 | 276 | 95.46 | 97.88 | 161.64 |
| | | | | | | | |
| 0 | 27.49 | 12048 | 1219 | 979 | 90.81 | 92.48 | 0.050 |
| 0.20 | 14.89 | 12638 | 1108 | 389 | 91.93 | 97.01 | 16.21 |
| 0.33 | 13.09 | 12645 | 1070 | 382 | 92.19 | 97.06 | 26.98 |
| 0.66 | 9.82 | 12646 | 909 | 381 | 93.29 | 97.07 | 53.91 |
| 1.00 | 7.35 | 12685 | 776 | 342 | 94.23 | 97.37 | 80.85 |
| 2.00 | 3.12 | 12730 | 554 | 297 | 95.83 | 97.72 | 161.65 |
We compare the SNPs detected by Samtools with the original FASTQ file and those obtained with the compressed files, using QualComp with one, two and three clusters and different rates. In all cases, reads were aligned first using the BWA algorithm. T.P., F.P. and F.N. stand for true positive (detected both with the original FASTQ file and the reconstructed one), false positive (detected only with the reconstructed FASTQ file) and false negative (detected only with the original FASTQ file), respectively. The selectivity parameter is computed as T.P./(T.P. + F.P.), and sensitivity as T.P./(T.P. + F.N.). Note that already for R = 0.2 the sensitivity is above 96% and the selectivity is above 91%.
SNP calling on the dataset with and without compression
| | | | | | | | |
|---|---|---|---|---|---|---|---|
| 0 | 75.64 | 54945 | 11560 | 5482 | 82.62 | 90.93 | 0.027 |
| 0.20 | 13.95 | 58806 | 5952 | 1621 | 90.81 | 97.32 | 27.37 |
| 0.25 | 12.55 | 58881 | 5707 | 1546 | 91.16 | 97.44 | 34.20 |
| 0.50 | 9.18 | 59078 | 5022 | 1349 | 92.17 | 97.77 | 68.38 |
| 1.00 | 6.53 | 59349 | 4541 | 1078 | 92.89 | 98.22 | 136.74 |
| 2.00 | 3.50 | 59628 | 3814 | 799 | 93.99 | 98.68 | 273.45 |
| | | | | | | | |
| 0 | 25.21 | 51007 | 5010 | 9420 | 91.05 | 84.41 | 0.054 |
| 0.20 | 9.09 | 58955 | 4949 | 1472 | 92.25 | 97.56 | 27.39 |
| 0.25 | 8.53 | 59002 | 4951 | 1425 | 92.25 | 97.64 | 34.23 |
| 0.50 | 7.17 | 59188 | 4784 | 1239 | 92.52 | 97.94 | 68.41 |
| 1.00 | 5.42 | 59400 | 4559 | 1027 | 92.87 | 98.30 | 136.76 |
| 2.00 | 3.02 | 59601 | 3718 | 826 | 94.12 | 98.63 | 273.48 |
| | | | | | | | |
| 0 | 17.32 | 52922 | 4686 | 7505 | 91.87 | 87.58 | 0.082 |
| 0.20 | 7.80 | 58913 | 4823 | 1514 | 92.43 | 97.49 | 27.42 |
| 0.25 | 7.26 | 58977 | 4766 | 1450 | 92.52 | 97.60 | 34.26 |
| 0.50 | 5.90 | 59111 | 4411 | 1316 | 93.06 | 97.82 | 68.44 |
| 1.00 | 4.16 | 59247 | 4041 | 1180 | 93.61 | 98.05 | 136.79 |
| 2.00 | 1.99 | 59589 | 3262 | 838 | 94.81 | 98.61 | 273.51 |
We compare the SNPs detected by Samtools with the original FASTQ file and those obtained with the compressed files, using QualComp with one, two and three clusters and different rates. For more details see Table 11.