| Literature DB >> 25649622 |
Abstract
MOTIVATION: Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed.Entities:
Mesh:
Year: 2015 PMID: 25649622 PMCID: PMC4481695 DOI: 10.1093/bioinformatics/btv071
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Compressed sizes (in bytes) using various methods
| Read set | Org. | S/P | 2-Bit | SCALCE | fastqz | CRAM | PathEnc | No. Trans. |
|---|---|---|---|---|---|---|---|---|
| SRR037452 ( | S | 102 487 744 | 66 630 706 | 80 465 928 | 156 554 323 | 4.10 | ||
| SRR445718 ( | S | 823 591 625 | 252 989 168 | 238 180 853 | 375 901 891 | 0.98 | ||
| SRR490961 ( | S | 1 228 191 700 | 300 176 711 | 316 478 709 | 518 183 711 | 0.74 | ||
| SRR635193 ( | P | 736 178 787 | 294 524 283 | 272 862 515 | 366 789 369 | 0.90 | ||
| SRR1294122 ( | S | 1 001 574 429 | 299 329 267 | 285 710 714 | 369 774 561 | 0.86 | ||
| SRR689233 ( | P | 738 357 525 | 233 812 737 | 266 126 542 | 929 644 204 | 1.46 | ||
| SRR519063 ( | P | 688 509 129 | 100 403 786 | 183 275 880 | 714 193 963 | 6.12 |
aOrganism (H.s., human; M.m., mouse; P.a., Pseudomonas aeruginosa).
bP indicates paired-end reads; S indicates single-end reads.
cNumber of transmissions before size of the reference is recovered.
Fig. 1.Sizes of the various components of the compressed files. ‘Read tails’ are the portion of the reads encoded using arithmetic encoding. ‘Bit tree’ gives the storage used by the bit tree for encoding the read starts (the first k = 16 letters of each read). ‘Read head counts’ is the space taken to store the number of reads with each start. ‘N locations’ is the space to store the location of input Ns that were changed to As upon encoding. ‘Flipped bits’ gives the space needed to record (in a compressed format) a single bit for each read indicating whether the read was reverse complemented
Fig. 2.Performance when several features of the path encoding scheme are disabled. All values are given as percentage over the encoding size for the encoding that uses all the features. (A) ‘No reference’ starts with an empty transcriptome reference. ‘No reverse complement’ disables the reverse complementation of the reads. ‘No duplicate handling’ disables the recognition and special encoding of exact duplicate reads. (B) ‘No dynamic updates’ gives the compression when the probabilities of the statistical model are not updated as reads are encoded
Fig. 3.Effective of kmer length. File size, represented as a fraction of the 2-bit encoding size, using various kmer lengths k.