| Literature DB >> 25474747 |
Jiarui Zhou, Zhen Ji, Zexuan Zhu, Shan He.
Abstract
BACKGROUND: The exponential growth of next-generation sequencing (NGS) derived DNA data poses great challenges to data storage and transmission. Although many compression algorithms have been proposed for DNA reads in NGS data, few methods are designed specifically to handle the quality scores.Entities:
Mesh:
Year: 2014 PMID: 25474747 PMCID: PMC4271560 DOI: 10.1186/1471-2105-15-S15-S10
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Multimodal optimization.
Figure 2Codebook based NGS quality score sequences compression.
Figure 3Compression codebook design using SaNSDE-DSCG based multimodal optimization.
NGS data sets for MMQSC performance evaluation.
| Data | Species | Number of Reads | Number of Bases | File Size (MB) |
|---|---|---|---|---|
| SRR027474 | Marine metagenome | 28,109 | 3,580,544 | 9.2 |
| SRR396942 | Homo sapiens | 1,199,786 | 250,755,274 | 602 |
| SRR824063 | Caenorhabditis elegans | 711,156 | 142,231,200 | 348 |
| SRR824065 | Caenorhabditis elegans | 64,492 | 12,898,400 | 32 |
| SRR932018 | Clostridium symbiosum | 169,457 | 8,472,850 | 27 |
Parameters setting for SaNSDE-DSCG optimization.
| Parameter | Population Size | Dimension | Range |
|
| FEs |
|---|---|---|---|---|---|---|
| (0, | | 0.1 × | 50 | 1E+4 |
Compression performance on experimental NGS data sets.
| SRR027474 | SRR396942 | SRR824063 | SRR824065 | SRR932018 | ||
|---|---|---|---|---|---|---|
| RLE | CR (%) | 38.95 | 60.52 | 54.97 | 47.27 | 64.29 |
| BPQ | 3.11 | 4.84 | 4.40 | 3.78 | 5.14 | |
| Huffman | CR (%) | 53.22 | 60.83 | 42.35 | 58.12 | 49.30 |
| BPQ | 4.26 | 4.87 | 3.39 | 4.65 | 3.94 | |
| gzip | CR (%) | 22.70 | 35.94 | 30.33 | 26.79 | 30.25 |
| BPQ | 1.82 | 2.88 | 2.43 | 2.14 | 2.42 | |
| bzip2 | CR (%) | 16.23 | 31.12 | 25.84 | 25.07 | |
| BPQ | 1.30 | 2.49 | 2.07 | 2.01 | ||
| LZMA | CR (%) | 17.63 | 31.32 | 25.63 | 23.08 | 25.00 |
| BPQ | 1.41 | 2.51 | 2.05 | 1.85 | 2.00 | |
| MMQSC | CR (%) | 27.38 | ||||
| BPQ | 2.19 | |||||
Figure 4Convergence trace of compression codebook optimization on experimental data sets.