| Literature DB >> 35672665 |
Yi Niu1,2, Mingming Ma3, Fu Li3, Xianming Liu4, Guangming Shi3.
Abstract
BACKGROUND: With the rapid development of high-throughput sequencing technology, the cost of whole genome sequencing drops rapidly, which leads to an exponential growth of genome data. How to efficiently compress the DNA data generated by large-scale genome projects has become an important factor restricting the further development of the DNA sequencing industry. Although the compression of DNA bases has achieved significant improvement in recent years, the compression of quality score is still challenging.Entities:
Keywords: Adaptive coding order; High-throughput sequencing; Lossless compression; Quality score compression
Mesh:
Substances:
Year: 2022 PMID: 35672665 PMCID: PMC9175485 DOI: 10.1186/s12859-022-04712-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1Quality score distribution curve of ERR2438054
Fig. 2schematic diagram of sequencing principle
Fig. 3Distribution of quality score made by FASTQ
Fig. 4Comparison of traditional scanning and ACO scanning: a traditional traversal method; b adaptive scan order
Fig. 5The way of row mean quantization
Fig. 6Composite context modeling strategy
Descriptions of 6 FASTQ datasets used for evaluation
| Run ID | Sequencing platform | FASTQ size(bytes) | Read length | Quality size(bytes) |
|---|---|---|---|---|
| NA12878_2 | BGISEQ-500 | 134363357648 | 2*100 | 56983386200 |
| ERR2438054_1 | BGISEQ-500 | 133406591610 | 2*150 | 47097570000 |
| ERR174324_1 | Illumina HiSeq 2000 | 57800970448 | 2*101 | 22580690796 |
| ERR174331_1 | Illumina HiSeq 2000 | 57210954538 | 2*101 | 22350322320 |
| ERR174327_1 | Illumina HiSeq 2000 | 54724344869 | 2*101 | 21379957043 |
| ERR174324_2 | Illumina HiSeq 2000 | 57800970448 | 2*101 | 22580690796 |
All algorithmic compression results for NGS data sets
| Run ID | Ratio | gzip | 7z | gtz | quip | fqz-q1 | fqz-q2 | fqz-q3 | Spring | ACO |
|---|---|---|---|---|---|---|---|---|---|---|
| NA12878_2 | CR(%) | 48.55 | 49.66 | 38.47 | 38.48 | 39.08 | 38.59 | 38.35 | 39.68 |
|
| BPQ | 3.88 | 3.97 | 3.08 | 3.08 | 3.13 | 3.09 | 3.07 | 3.17 |
| |
| ERR2438054_1 | CR(%) | 46.23 | 47.11 | 37.09 | 36.52 | 37.08 | 36.71 | 37.63 | 37.07 |
|
| BPQ | 3.70 | 3.77 | 2.97 | 2.92 | 2.97 | 2.94 | 2.92 | 3.01 |
| |
| ERR174324_1 | CR(%) | 36.58 | 36.94 | 25.47 | 26.14 | 27.30 | 25.81 | 24.90 | 26.39 |
|
| BPQ | 2.93 | 2.96 | 2.04 | 2.09 | 2.18 | 2.06 | 1.99 | 2.11 |
| |
| ERR174331_1 | CR(%) | 36.55 | 36.91 | 25.45 | 26.11 | 27.27 | 25.77 | 24.86 | 26.37 |
|
| BPQ | 2.92 | 2.95 | 2.04 | 2.09 | 2.18 | 2.06 | 1.99 | 2.11 |
| |
| ERR174327_1 | CR(%) | 35.53 | 35.88 | 24.56 | 25.31 | 26.39 | 24.90 | 24.02 | 25.45 |
|
| BPQ | 2.84 | 2.87 | 1.96 | 2.02 | 2.11 | 1.99 | 1.92 | 2.04 |
| |
| ERR174324_2 | CR(%) | 38.47 | 38.81 | 27.07 | 27.52 | 28.89 | 27.37 | 26.35 | 28.03 |
|
| BPQ | 3.08 | 3.10 | 2.17 | 2.20 | 2.31 | 2.19 | 2.11 | 2.24 |
|