| Literature DB >> 26051252 |
Yongpeng Zhang1, Linsen Li2, Yanli Yang3, Xiao Yang4, Shan He5, Zexuan Zhu6.
Abstract
BACKGROUND: The exponential growth of next generation sequencing (NGS) data has posed big challenges to data storage, management and archive. Data compression is one of the effective solutions, where reference-based compression strategies can typically achieve superior compression ratios compared to the ones not relying on any reference.Entities:
Mesh:
Year: 2015 PMID: 26051252 PMCID: PMC4459677 DOI: 10.1186/s12859-015-0628-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The general framework of LW-FQZip
Examples of metadata encoding
| Original metadata | Incremental coding |
|---|---|
| SRR001471.1 E96DJWM01D47CS length = 110 | SRR001471.1 E96DJWM01D47CS |
| SRR001471.2 E96DJWM01CO1KR length = 297 | 9 CO1KR |
| SRR001471.3 E96DJWM01AL88Q length = 270 | 9 AL88Q |
| SRR001471.4 E96DJWM01ALL6A length = 274 | 11 L6A |
Fig. 2Flowchart of the light-weight mapping model
The descriptions of the output mapping result fields
| Field | Description |
|---|---|
| POS | The position on reference where the read is optimally aligned. |
| PAL | PAL=’0’ indicates the alignment of palindrome structure. |
| MLength | The number of matched or mismatched characters in the alignment. |
| MType |
|
| MisValues | One or multiple bases in {‘A’, ‘C’, ‘G’, ‘T’, ‘N’} |
Fig. 3Examples of short read encoding
Real-world FASTQ data sets used for performance evaluation
| Data | Species | Read Length | Number of Reads | Size (GB) | Reference |
|---|---|---|---|---|---|
| ERR231645 |
| 51 | 6,344,039 | 1.41 | NC_000913 |
| ERR005143 |
| 2*72 | 3,551,133 | 0.89 | NC_007005 |
| SRR352384 |
| 2*76 | 26,030,832 | 9.88 | NC_001136.10 |
| SRR801793 |
| 2*100 | 5,406,461 | 2.75 | NC_018140 |
| SRR554369 |
| 2*200 | 1,657,871 | 0.82 | KI517354 |
| ERR654984 |
| 64-502 | 1,167,295 | 1.21 | NC_000913 |
| ERR233152 |
| 77 | 2,745,192 | 0.72 | AP014622 |
| SRR327342 |
| 138 | 15,036,699 | 5.74 | ACFL01000033 |
Compression ratios of the compared methods on the eight FASTQ data sets
| quip | quip -a | DSRC | DSRC2 | Fqzcomp | LW-FQZip | bzip2 | |
|---|---|---|---|---|---|---|---|
| ERR231645 | 0.139 |
| 0.164 | 0.160 | 0.136 | 0.127 | 0.208 |
| ERR005143 | 0.154 | 0.154 | 0.179 | 0.176 | 0.156 |
| 0.211 |
| SRR352384 | 0.115 | 0.115 | 0.145 | 0.144 | 0.126 |
| 0.183 |
| SRR801793 | 0.184 | 0.184 | 0.235 | 0.234 | 0.202 |
| 0.268 |
| SRR554369 | 0.194 | 0.194 | 0.243 | 0.232 | 0.201 |
| 0.262 |
| ERR654984 | 0.188 | 0.188 | 0.235 | 0.236 | 0.204 |
| 0.262 |
| ERR233152 | 0.129 | 0.128 | 0.153 | 0.147 | 0.128 |
| 0.177 |
| SRR327342 |
|
| 0.242 | 0.241 | 0.202 | 0.201 | 0.271 |
| Average | 0.151 | 0.150 | 0.190 | 0.189 | 0.162 |
| 0.223 |
The compression ratios of LW-FQZip on the three components of FASTQ files
| Data | Metadata | Nucleotide sequence | Quality scores |
|---|---|---|---|
| ERR231645 | 0.027 | 0.091 | 0.421 |
| ERR005143 | 0.021 | 0.159 | 0.364 |
| SRR352384 | 0.024 | 0.151 | 0.130 |
| SRR801793 | 0.025 | 0.089 | 0.371 |
| SRR554369 | 0.029 | 0.117 | 0.346 |
| ERR654984 | 0.024 | 0.032 | 0.285 |
| ERR233152 | 0.015 | 0.190 | 0.238 |
| SRR327342 | 0.014 | 0.155 | 0.426 |
| Average | 0.021 | 0.134 | 0.268 |
The mapping result of the proposed light-weight mapping model against that of BWA
| Data | BWA | Light-weight mapping model | |||||
|---|---|---|---|---|---|---|---|
| #Unmapped reads | #Mapped reads | #Unmapped reads | #Mapped reads | Na | |||
| Exact | Inexact | Exact | Inexact | ||||
| ERR231645 | 133,518 | 6,047,074 | 163,447 | 606,902 | 5,460,749 | 276,388 | 17,866 |
| ERR005143 | 248,883 | 2 | 3,302,248 | 188,801 | 2 | 3,362,330 | 1,417,824 |
| SRR352384 | 22,182,981 | 2 | 3,847,849 | 22,437,492 | 1 | 3,593,339 | 960,640 |
| SRR801793 | 424,130 | 0 | 4,982,331 | 1,345,629 | 0 | 4,060,832 | 1,220,888 |
| SRR554369 | 1,114,495 | 0 | 543,376 | 319,275 | 0 | 1,338,596 | 596,529 |
| ERR654984 | 3596 | 0 | 1,163,699 | 4403 | 0 | 1,162,892 | 1,107,540 |
| ERR233152 | 1,005,747 | 6603 | 1,732,842 | 524,812 | 7375 | 2,213,005 | 69,929 |
| SRR327342 | 14,769,970 | 0 | 266,729 | 14,776,503 | 0 | 260,196 | 64,765 |
| Average | 64.39 % | 9.77 % | 25.84 % | 64.91 % | 8.83 % | 26.26 % | 33.54 % |
N*: the number of unmapped segments that can be realigned to the same reference genome
Mapping time and compression ratios using BWA and the light-weight mapping model
| Data | BWA | Light-weight mapping model | ||
|---|---|---|---|---|
| Mapping time (s) | Compression ratios | Mapping time (s) | Compression ratios | |
| ERR231645 | 130.30 | 0.124 | 46.60 | 0.127 |
| ERR005143 | 533.72 | 0.150 | 56.21 | 0.151 |
| SRR352384 | 4036.98 | 0.111 | 289.02 | 0.111 |
| SRR801793 | 1907.43 | 0.174 | 79.19 | 0.176 |
| SRR554369 | 449.68 | 0.179 | 105.33 | 0.182 |
| ERR654984 | 282.84 | 0.146 | 69.37 | 0.140 |
| ERR233152 | 209.38 | 0.120 | 82.43 | 0.126 |
| SRR327342 | 2458.69 | 0.201 | 148.38 | 0.201 |
| Average | 1251.13 | 0.147 | 109.57 | 0.148 |