| Literature DB >> 26895947 |
Minji Kim1, Xiejia Zhang2, Jonathan G Ligo3, Farzad Farnoud4, Venugopal V Veeravalli5, Olgica Milenkovic6.
Abstract
BACKGROUND: Metagenomics is a genomics research discipline devoted to the study of microbial communities in environmental samples and human and animal organs and tissues. Sequenced metagenomic samples usually comprise reads from a large number of different bacterial communities and hence tend to result in large file sizes, typically ranging between 1-10 GB. This leads to challenges in analyzing, transferring and storing metagenomic data. In order to overcome these data processing issues, we introduce MetaCRAM, the first de novo, parallelized software suite specialized for FASTA and FASTQ format metagenomic read processing and lossless compression.Entities:
Mesh:
Year: 2016 PMID: 26895947 PMCID: PMC4759986 DOI: 10.1186/s12859-016-0932-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Block Diagram. The block diagram of the MetaCRAM Algorithm for Metagenomic Data Processing and Compression. Its main components are taxonomy identification, alignment, assembly and compression
Analysis of the influence of different threshold values on reference genome selection after taxonomy identification and compression ratios
| Data | Original (MB) |
|
|
|
| Comp. (MB) | Processing time | Align. % | No. files |
|---|---|---|---|---|---|---|---|---|---|
| ERR321482 | 1429 | 191 | 299 m 20 s | 26.99 | 211 | 193 | 239 m 28 s | 24.22 | 29 |
| 422 m 21 s | 3.57 | 1480 | 398 m 3 s | 6.5 | 1567 | ||||
| 12 m 24 s | 8 m 13 s | ||||||||
| SRR359032 | 3981 | 319 | 127 m 34 s | 57.72 | 26 | 320 | 93 m 60 s | 57.71 | 7 |
| 245 m 53 s | 9.7 | 30 | 206 m 18 s | 9.71 | 32 | ||||
| 8 m 37s | 7 m 27 s | ||||||||
| ERR532393 | 8230 | 948 | 639 m 55 s | 45.78 | 267 | 963 | 522 m | 42.45 | 39 |
| 1061 m 50 s | 1.98 | 1456 | 1067 m 49 s | 7.16 | 1639 | ||||
| 73 m 59 s | 28 m 13s | ||||||||
| SRR1450398 | 5399 | 703 | 440 m 4 s | 7.14 | 190 | 703 | 364 m 34 s | 6.82 | 26 |
| 866 m 56 s | 0.6 | 793 | 790 m 52 s | 0.91 | 818 | ||||
| 21 m 2 s | 17 m 38 s | ||||||||
| SRR062462 | 6478 | 137 | 217 m 21 s | 2.55 | 278 | 139 | 197 m 15 s | 2.13 | 50 |
| 254 m 26 s | 0.13 | 570 | 241 m 2 s | 0.51 | 656 | ||||
| 15 m 45 s | 19 m 31 s |
Columns in bold represent a threshold of 75 species, while the columns not bolded correspond to a cutoff of 10 species. The results are shown for MetaCRAM-Huffman. “Align. %” refers to the alignment rates for the first and second round, and “No. files” refers to the number of reference genome files selected in the first and second iteration. Processing times are recorded row by row denoting real, user, and system time in order
Comparison of compression ratios of six software suites
| Data | Original (MB) | MCH1 (MB) | MCH2 (MB) | MCG (MB) | MCEG (MB) | Align. % | Qual value (MB) | bzip2 (MB) | gzip (MB) | MFComp (MB) |
|---|---|---|---|---|---|---|---|---|---|---|
| ERR321482 | 1429 |
| 186 | 312 | 213 | 29.6 | 411 | 362 | 408 | 229 |
| SRR359032 | 3981 | 319 | 282 | 657 | 458 | 61.8 | 2183 | 998 | 1133 |
|
| ERR532393 | 8230 |
| 898 | 1503 | 1145 | 46.8 | 3410 | 2083 | 2366 | 1126 |
| SRR1450398 | 5399 |
| 697 | 854 | 729 | 7.7 | 365 | 1345 | 1532 | 726 |
| SRR062462 | 6478 |
| 135 | 188 | 144 | 2.7 | 153 | 222 | 356 | 161 |
For short hand notation, we used“MCH” = MetaCRAM-Huffman, “MCG” = MetaCRAM-Golomb, “MCEG” = MetaCRAM-extended Golomb, “MFComp” = MFCompress. MCH1 is the default option of MetaCRAM with Huffman encoding, and MCH2 is a version of MetaCRAM in which we removed the redundancy in both quality scores and the read IDs. “Align. %” refers to the total alignment rates from the first and second iteration. Minimum compressed file size achievable by the methods are written in bold case letters. Minimum compressed file size achievable by the methods are written in bold case letters
Fig. 2Compression ratio. The compression ratios for all six software suites, indicating the compression ratio
Comparison of processing (compression) times of six software suites. Times are recorded row by row denoting real, user, and system time in order
| Data | Time | MCH | MCG | MCEG | bzip2 | gzip | MFComp |
|---|---|---|---|---|---|---|---|
| ERR321482 | real | 299 m 20 s | 294 m 27 s | 274 m 43 s | 2 m 2 s | 3 m 49 s | 2 m 38 s |
| user | 422 m 21 s | 422 m 49 s | 402 m 25 s | 1 m 56 s | 3 m 45 s | 4 m 49 s | |
| sys | 12 m 24 s | 8 m 48 s | 12 m 13 s | 0 m 1 s | 0 m 1 s | 0 m 13 s | |
| SRR359032 | real | 127 m 34 s | 129 m 32 s | 128 m 14 s | 5 m 36 s | 10 m 39 s | 8 m 2 s |
| user | 245 m 53 s | 247 m 43 s | 253 m 16 s | 5 m 19 s | 10 m 30 s | 13 m 3 s | |
| sys | 8 m 37 s | 10 m 1 s | 15 m 25 s | 0 m 2 s | 0 m 2 s | 0 m 15 s | |
| ERR532393 | real | 639 m 55 s | 635 m 53 s | 641 m 32 s | 11 m 28 s | 22 m 18 s | 17 m 2 s |
| user | 1061 m 50 s | 1069 m 9 s | 1090 m 20 s | 11 m 4 s | 21 m 58 s | 28 m 29 s | |
| sys | 73 m 59 s | 27 m 59 s | 43 m 35 s | 0 m 5 s | 0 m 5 s | 0 m 21 s | |
| SRR1450398 | real | 440 m 4 s | 439 m 42 s | 440 m 36 s | 7 m 38 s | 14 m 39 s | 10 m 32 s |
| user | 66 m 56 s | 865 m 38 s | 865 m 6 s | 7 m 19 s | 14 m 24 s | 18 m 8 s | |
| sys | 821 m 2 s | 23 m 51 s | 26 m 5 s | 0 m 3 s | 0 m 3 s | 0 m 18 s | |
| SRR062462 | real | 217 m 21 s | 224 m 32 s | 215 m 58 s | 2 m 48 s | 2 m 6 s | 6 m 38 s |
| user | 254 m 26 s | 261 m 19 s | 256 m 17 s | 2 m 7 s | 1 m 18 s | 10 m 39 s | |
| sys | 15 m 45 s | 16 m 48 s | 20 m 14 s | 0 m 3 s | 0 m 3 s | 0 m 16 s |
Fig. 3Average Runtime of Each Stage of MetaCRAM. Detailed distribution of the average runtimes of MetaCRAM for all five datasets tested. We used “_1” to indicate the processes executed in the first round, and “_2” to denote the processes executed in the second round
Comparison of compressed file sizes of MetaCRAM-Huffman using 2 rounds and 1 round
| Data | Original (MB) | MCH-2rounds (MB) | Align. % | MCH-1round (MB) | Align. % | gzip (MB) | MFComp (MB) |
|---|---|---|---|---|---|---|---|
| ERR321482 | 1429 | 191 | 29.6 | 192 | 27 | 408 | 229 |
| SRR359032 | 3981 | 319 | 61.8 | 315 | 57.7 | 1133 | 263 |
| ERR532393 | 8230 | 948 | 46.8 | 952 | 45.8 | 2366 | 1126 |
| SRR1450398 | 5399 | 703 | 7.7 | 707 | 7.1 | 1532 | 726 |
| SRR062462 | 6478 | 137 | 2.7 | 143 | 2.6 | 356 | 161 |
For short hand notation, we used“MCH-2rounds” = MetaCRAM-Huffman with 2 rounds, “MCH-1round” = MetaCRAM-Huffman with 1 round. We also used the shortcut “MFComp” = MFCompress and “Align. %” refers to the percentage of reads aligned during 2 rounds and 1 round, respectively, for MCH-2rounds and MCH-1round
Comparison of retrieval (decompression) times of six software suites. Times are recorded row by row denoting real, user, and system time in order
| Data | Time | MCH | MCG | MCEG | bzip2 | gzip | MFComp |
|---|---|---|---|---|---|---|---|
| ERR321482 | real | 23 m 17 s | 25 m 18 s | 24 m 56 s | 0 m 57 s | 0 m 17 s | 2 m 26 s |
| user | 16 m 17 s | 16 m 30 s | 17 m 7 s | 0 m 45 s | 0 m 9 s | 4 m 42 s | |
| sys | 9 m 2 s | 10 m 42 s | 10 m 25 s | 0 m 2 s | 0 m 1 s | 0 m 4 s | |
| SRR359032 | real | 12 m 16 s | 11 m 43 s | 13 m 17 s | 2 m 37 s | 1 m 28 s | 7 m 58 s |
| user | 11 m 59 s | 11 m 24 s | 12 m 43 s | 2 m 8 s | 0 m 28 s | 15 m 10 s | |
| sys | 2 m 24 s | 1 m 42 s | 3 m 12 s | 0 m 4 s | 0 m 2 s | 0 m 19 s | |
| ERR532393 | real | 48 m 19 s | 47 m 5 s | 55 m 58 s | 5 m 25 s | 2 m 30 s | 15 m 29 s |
| user | 39 m 59 s | 40 m 5 s | 43 m 21 s | 4 m 23 s | 0 m 55 s | 29 m 23 s | |
| sys | 15 m 39 s | 13 m 25 s | 29 m 17 s | 0 m 7 s | 0 m 5 s | 0 m 17 s | |
| SRR1450398 | real | 28 m 43 s | 27 m 54 s | 29 m 27 s | 3 m 25 s | 1 m 54 s | 10 m 8 s |
| user | 29 m 55 s | 29 m 47 s | 30 m 45 s | 2 m 52 s | 0 m 37 s | 19 m 1 s | |
| sys | 7 m 10 s | 5 m 52 s | 7 m 4 s | 0 m 5 s | 0 m 3 s | 0 m 26 s | |
| SRR062462 | real | 23 m 9 s | 22 m 55 s | 26 m 6 s | 1 m 3 s | 1 m 19 s | 5 m 52 |
| user | 21 m 10 s | 21 m 10 s | 21 m 58 s | 0 m 42 s | 0 m 22 s | 10 m 31 s | |
| sys | 4 m 49 s | 4 m 53 s | 10 m 12 s | 0 m 4 s | 0 m 3 s | 0 m 26 s |
Fig. 4Integer Distribution. Distribution fitting of integers to be encoded, truncated at 90 % of the integer data
Processing time improvements for two rounds of MetaCRAM on the SRR359032 dataset (5.4 GB, without removing redundancy in description lines) resulting from parallelization of assembly and compression
| Time | Without parallelization | With parallelization | Reduction (%) |
|---|---|---|---|
| Real | 235 m 40 s | 170 m 4 s | 27.7 |
| User | 449 m 40 s | 346 m 33 s | 22.9 |
| System | 14 m 13 s | 8 m 45 s | 40.1 |