| Literature DB >> 28673253 |
Axel Wedemeyer1, Lasse Kliemann2, Anand Srivastav2, Christian Schielke2, Thorsten B Reusch3, Philip Rosenstiel4.
Abstract
BACKGROUND: For single-cell or metagenomic sequencing projects, it is necessary to sequence with a very high mean coverage in order to make sure that all parts of the sample DNA get covered by the reads produced. This leads to huge datasets with lots of redundant data. A filtering of this data prior to assembly is advisable. Brown et al. (2012) presented the algorithm Diginorm for this purpose, which filters reads based on the abundance of their k-mers.Entities:
Keywords: Bignorm; Coverage; Diginorm; Read filtering; Read normalization; Singe cell sequencing
Mesh:
Year: 2017 PMID: 28673253 PMCID: PMC5496428 DOI: 10.1186/s12859-017-1724-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Coverage statistics for Bignorm with Q 0=20, Diginorm, and the raw datasets
| Dataset | Algorithm |
| Mean |
| Max |
|---|---|---|---|---|---|
| Aceto | Bignorm | 6 | 132 | 216 | 6801 |
| Diginorm | 7 | 171 | 295 | 12,020 | |
| Raw | 15 | 9562 | 17,227 | 551,000 | |
| Alphaproteo | Bignorm | 10 | 43 | 92 | 884 |
| Diginorm | 7 | 173 | 481 | 6681 | |
| Raw | 25 | 5302 | 14,070 | 303,200 | |
| Arco | Bignorm | 1 | 98 | 54 | 2103 |
| Diginorm | 1 | 362 | 200 | 6114 | |
| Raw | 3 | 10,850 | 4091 | 220,600 | |
| Arma | Bignorm | 8 | 23 | 32 | 358 |
| Diginorm | 8 | 79 | 141 | 5000 | |
| Raw | 17 | 629 | 1118 | 31,260 | |
| ASZN2 | Bignorm | 40 | 70 | 83 | 2012 |
| Diginorm | 23 | 143 | 354 | 3437 | |
| Raw | 50 | 1738 | 4784 | 43,840 | |
| Bacteroides | Bignorm | 3 | 74 | 90 | 6768 |
| Diginorm | 3 | 123 | 205 | 7933 | |
| Raw | 7 | 6051 | 8127 | 570,900 | |
| Caldi | Bignorm | 25 | 63 | 110 | 786 |
| Diginorm | 15 | 67 | 135 | 3584 | |
| Raw | 27 | 1556 | 3643 | 33,530 | |
| Caulo | Bignorm | 7 | 228 | 216 | 10,400 |
| Diginorm | 8 | 362 | 491 | 35,520 | |
| Raw | 8 | 10,220 | 9737 | 464,300 | |
| Chloroflexi | Bignorm | 8 | 72 | 101 | 2822 |
| Diginorm | 9 | 412 | 878 | 20,850 | |
| Raw | 9 | 5612 | 7741 | 316,900 | |
| Crenarch | Bignorm | 8 | 104 | 159 | 3770 |
| Diginorm | 10 | 560 | 1285 | 29,720 | |
| Raw | 10 | 8086 | 14,987 | 316,700 | |
| Cyanobact | Bignorm | 9 | 144 | 153 | 5234 |
| Diginorm | 10 | 756 | 1450 | 26,980 | |
| Raw | 10 | 9478 | 11,076 | 356,600 | |
| E.coli | Bignorm | 37 | 45 | 56 | 234 |
| Diginorm | 50 | 382 | 922 | 7864 | |
| Raw | 112 | 2522 | 6378 | 56,520 | |
| SAR324 | Bignorm | 24 | 49 | 71 | 1410 |
| Diginorm | 18 | 53 | 107 | 2473 | |
| Raw | 26 | 1086 | 2761 | 106,000 |
Selected species and datasets (Cases)
| Short name | Species/Description | Source | URL |
|---|---|---|---|
| ASZN2 | Candidatus Poribacteria sp. WGA-4E_FD | Hentschel Group [ | [ |
| Aceto | Acetothermia bacterium JGI MDM2 LHC4sed-1-H19 | JGI Genome Portal | [ |
| Alphaproteo | Alphaproteobacteria bacterium SCGC AC-312_D23v2 | JGI Genome Portal | [ |
| Arco | Arcobacter sp. SCGC AAA036-D18 | JGI Genome Portal | [ |
| Arma | Armatimonadetes bacterium JGI 0000077-K19 | JGI Genome Portal | [ |
| Bacteroides | Bacteroidetes bacVI JGI MCM14ME016 | JGI Genome Portal | [ |
| Caldi | Calescamantes bacterium JGI MDM2 SSWTFF-3-M19 | JGI Genome Portal | [ |
| Caulo | Caulobacter bacterium JGI SC39-H11 | JGI Genome Portal | [ |
| Chloroflexi | Chloroflexi bacterium SCGC AAA257-O03 | JGI Genome Portal | [ |
| Crenarch | Crenarchaeota archaeon SCGC AAA261-F05 | JGI Genome Portal | [ |
| Cyanobact | Cyanobacteria bacterium SCGC JGI 014-E08 | JGI Genome Portal | [ |
| E.coli | E.coli K-12, strain MG1655, single cell MDA, Cell one | UC San Diego | [ |
| SAR324 | SAR324 (Deltaproteobacteria) | UC San Diego | [ |
Fig. 1Box plots showing reduction and quality statistics. a Percentage of accepted reads (i.e. reads kept) over all datasets. b Mean quality values of the accepted reads over all datasets
Comparing quality values for the raw dataset and Bignorm with Q 0=20
| Quartile | Bignorm | Raw |
|---|---|---|
|
| 37.82 | 37.37 |
|
| 37.33 | 36.52 |
|
| 33.77 | 32.52 |
|
| 31.91 | 30.50 |
|
| 26.14 | 24.34 |
Fig. 2Box plots showing coverage statistics. a Mean coverage over all datasets. b 10th percentile of the coverage over all datasets
Fig. 3Assembly statistics for four selected datasets; measurements of assemblies performed on the datasets with prior filtering using Diginorm and Bignorm, relative to the results of assemblies performed on the unfiltered datasets
Filter and assembly statistics for Bignorm with Q 0=20, Diginorm, and the raw datasets (Part I)
| Dataset | Algorithm | Reads kept | Mean phred | Contigs | Filter time | SPAdes time |
|---|---|---|---|---|---|---|
| in % | score | ≥10 000 | in sec | in sec | ||
| Aceto | Bignorm | 3.16 | 37.33 | 1 | 906 | 1708 |
| Diginorm | 3.95 | 27.28 | 1 | 3290 | 4363 | |
| Raw | 36.52 | 3 | 47,813 | |||
| Alphaproteo | Bignorm | 3.13 | 34.65 | 18 | 623 | 420 |
| Diginorm | 7.81 | 28.73 | 17 | 1629 | 11,844 | |
| Raw | 33.64 | 17 | 29,057 | |||
| Arco | Bignorm | 2.20 | 33.77 | 4 | 429 | 207 |
| Diginorm | 8.76 | 21.39 | 6 | 1410 | 1385 | |
| Raw | 32.27 | 6 | 15,776 | |||
| Arma | Bignorm | 7.90 | 28.21 | 44 | 240 | 135 |
| Diginorm | 29.30 | 21.19 | 50 | 588 | 1743 | |
| Raw | 26.96 | 44 | 5371 | |||
| ASZN2 | Bignorm | 5.66 | 37.66 | 118 | 1224 | 1537 |
| Diginorm | 12.62 | 32.73 | 130 | 5125 | 21,626 | |
| Raw | 36.85 | 112 | 47,859 | |||
| Bacteroides | Bignorm | 2.85 | 37.47 | 6 | 653 | 3217 |
| Diginorm | 4.94 | 27.64 | 5 | 2124 | 3668 | |
| Raw | 37.25 | 9 | 32,409 | |||
| Caldi | Bignorm | 3.97 | 37.82 | 41 | 842 | 455 |
| Diginorm | 5.61 | 30.67 | 36 | 1838 | 793 | |
| Raw | 37.37 | 38 | 7563 | |||
| Caulo | Bignorm | 2.40 | 36.95 | 10 | 679 | 712 |
| Diginorm | 4.70 | 25.16 | 9 | 2584 | 765 | |
| Raw | 36.01 | 13 | 18,497 | |||
| Chloroflexi | Bignorm | 1.40 | 31.91 | 32 | 694 | 134 |
| Diginorm | 9.70 | 18.91 | 33 | 2304 | 1852 | |
| Raw | 30.50 | 34 | 15,108 | |||
| Crenarch | Bignorm | 1.46 | 33.18 | 19 | 1107 | 790 |
| Diginorm | 9.72 | 19.80 | 18 | 2931 | 3754 | |
| Raw | 31.49 | 26 | 20,590 | |||
| Cyanobact | Bignorm | 1.65 | 30.45 | 12 | 679 | 450 |
| Diginorm | 11.30 | 17.58 | 13 | 1487 | 1343 | |
| Raw | 28.49 | 13 | 9417 | |||
| E. coli | Bignorm | 1.91 | 26.14 | 67 | 2279 | 598 |
| Diginorm | 17.03 | 19.34 | 63 | 9105 | 3995 | |
| Raw | 24.34 | 64 | 16,706 | |||
| SAR324 | Bignorm | 4.34 | 33.05 | 55 | 1222 | 708 |
| Diginorm | 4.69 | 23.58 | 52 | 3706 | 3085 | |
| Raw | 32.52 | 51 | 26,237 |
Filter and assembly statistics for Bignorm with Q 0=20, Diginorm, and the raw datasets (Part II)
| Dataset | Algorithm | N50 | Longest contig length | Genomic fraction | Misassembled contig length | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| abs | % of raw | % of Diginorm | abs | % of raw | % of Diginorm | abs | % of raw | % of Diginorm | abs | % of raw | % of Diginorm | ||
| Aceto | Bignorm | 2324 | 79 | 105 | 11,525 | 98 | 100 | 91 | 97 | 97 | 52,487 | 148 | 178 |
| Diginorm | 2216 | 76 | 11,525 | 98 | 94 | 100 | 29,539 | 84 | |||||
| Raw | 2935 | 11,772 | 94 | 35,351 | |||||||||
| Alphaproteo | Bignorm | 11,750 | 94 | 115 | 43,977 | 91 | 95 | 98 | 101 | 105 | 52,001 | 120 | 89 |
| Diginorm | 10,213 | 82 | 46,295 | 95 | 93 | 95 | 58,184 | 134 | |||||
| Raw | 12,446 | 48,586 | 98 | 43,388 | |||||||||
| Arco | Bignorm | 3320 | 81 | 97 | 12,808 | 57 | 57 | 85 | 100 | 97 | 76,797 | 99 | 91 |
| Diginorm | 3434 | 84 | 22,463 | 100 | 88 | 103 | 84,613 | 109 | |||||
| Raw | 4092 | 22,439 | 85 | 77,888 | |||||||||
| Arma | Bignorm | 18,432 | 102 | 107 | 108,140 | 100 | 100 | 98 | 100 | 100 | 774,291 | 91 | 103 |
| Diginorm | 17,288 | 96 | 108,498 | 100 | 98 | 100 | 748,560 | 88 | |||||
| Raw | 18,039 | 108,498 | 98 | 849,085 | |||||||||
| ASZN2 | Bignorm | 19,788 | 91 | 88 | 72,685 | 71 | 88 | 97 | 99 | 99 | 2,753,167 | 94 | 105 |
| Diginorm | 16,591 | 76 | 82687 | 81 | 97 | 100 | 2,617,095 | 89 | |||||
| Raw | 21,784 | 102,287 | 97 | 2,941,524 | |||||||||
| Bacteroides | Bignorm | 3356 | 68 | 100 | 25,300 | 100 | 100 | 95 | 98 | 99 | 70,206 | 105 | 112 |
| Diginorm | 3356 | 68 | 25,300 | 100 | 96 | 99 | 62,882 | 94 | |||||
| Raw | 4930 | 25,299 | 98 | 66,626 | |||||||||
| Caldi | Bignorm | 50,973 | 82 | 83 | 143,346 | 89 | 91 | 100 | 100 | 100 | 573,836 | 94 | 68 |
| Diginorm | 61,108 | 98 | 157,479 | 98 | 100 | 100 | 839,126 | 138 | |||||
| Raw | 62,429 | 160,851 | 100 | 609,604 | |||||||||
| Caulo | Bignorm | 4515 | 69 | 95 | 20,255 | 100 | 107 | 96 | 98 | 98 | 60,362 | 86 | 113 |
| Diginorm | 4729 | 72 | 18,907 | 93 | 98 | 101 | 53,456 | 76 | |||||
| Raw | 6562 | 20,255 | 97 | 70,161 | |||||||||
| Chloroflexi | Bignorm | 13,418 | 102 | 109 | 79,605 | 102 | 102 | 99 | 100 | 100 | 666,519 | 95 | 93 |
| Diginorm | 12,305 | 93 | 78,276 | 100 | 100 | 100 | 716,473 | 102 | |||||
| Raw | 13,218 | 78,276 | 99 | 703,171 | |||||||||
| Crenarch | Bignorm | 6538 | 77 | 91 | 31,401 | 81 | 66 | 97 | 99 | 99 | 484,354 | 89 | 95 |
| Diginorm | 7148 | 84 | 47,803 | 124 | 98 | 100 | 510,256 | 94 | |||||
| Raw | 8501 | 38,582 | 98 | 544,763 | |||||||||
| Cyanobact | Bignorm | 5833 | 95 | 99 | 33,462 | 98 | 100 | 99 | 101 | 100 | 236,391 | 113 | 110 |
| Diginorm | 5907 | 96 | 33,516 | 98 | 99 | 101 | 214,574 | 103 | |||||
| Raw | 6130 | 34,300 | 98 | 209,269 | |||||||||
| E. coli | Bignorm | 112,393 | 100 | 100 | 268,306 | 94 | 94 | 96 | 100 | 100 | 28,966 | 65 | 65 |
| Diginorm | 112,393 | 100 | 285,311 | 100 | 96 | 100 | 44,465 | 100 | |||||
| Raw | 112,393 | 285,528 | 96 | 44,366 | |||||||||
| SAR324 | Bignorm | 135,669 | 100 | 114 | 302,443 | 100 | 100 | 99 | 100 | 100 | 4,259,479 | 98 | 100 |
| Diginorm | 119,529 | 88 | 302,443 | 100 | 99 | 100 | 4,264,234 | 98 | |||||
| Raw | 136,176 | 302,442 | 99 | 4,342,602 | |||||||||
Reference length and total length of assemblies for Bignorm with Q 0=20, Diginorm, and the raw datasets
| Dataset | Reference | Raw | Diginorm | Bignorm | |||
|---|---|---|---|---|---|---|---|
| Ref length | Total length | % of ref | Total length | % of ref | Total length | % of ref | |
| Aceto | 426,710 | 750,316 | 175.80 | 769,090 | 180.20 | 731,850 | 171.50 |
| Alphaproteo | 463,456 | 405,020 | 87.40 | 377,293 | 81.40 | 394,979 | 85.20 |
| Arco | 231,937 | 408,571 | 176.20 | 419,403 | 180.80 | 380,191 | 163.90 |
| Arma | 1,364,272 | 2,123,588 | 155.70 | 2,131,958 | 156.30 | 2,077,037 | 152.20 |
| ASZN2 | 3,669,182 | 4,938,079 | 134.60 | 4,930,677 | 134.40 | 4,836,216 | 131.80 |
| Bacteroides | 560,676 | 826,566 | 147.40 | 818,799 | 146.00 | 792,384 | 141.30 |
| Caldi | 1,961,164 | 2,044,270 | 104.20 | 2,041,841 | 104.10 | 2,037,901 | 103.90 |
| Caulo | 423,390 | 601,709 | 142.10 | 616,942 | 145.70 | 590,319 | 139.40 |
| Chloroflexi | 863,677 | 1,317,768 | 152.60 | 1,326,848 | 153.60 | 1,186,531 | 137.40 |
| Crenarch | 716,004 | 1,009,122 | 140.90 | 1,016,485 | 142.00 | 946,606 | 132.20 |
| Cyanobact | 343,353 | 635,368 | 185.00 | 636,876 | 185.50 | 591,367 | 172.20 |
| E. coli | 4,639,675 | 4,896,992 | 105.50 | 4,898,422 | 105.60 | 4,948,739 | 106.70 |
| SAR324 | 4,255,983 | 4,676,938 | 109.90 | 4,674,540 | 109.80 | 4,669,774 | 109.70 |
Quartiles for comparison of mean phred score, filter and assembler Wall time in %
| Min |
| Median | Mean |
| Max | |
|---|---|---|---|---|---|---|
|
| 62 | 66 | 74 | 74 | 79 | 89 |
| raw mean phred score | ||||||
|
| 24 | 28 | 31 | 33 | 38 | 46 |
| Diginorm filter time | ||||||
|
| 4 | 08 | 18 | 26 | 35 | 88 |
| Diginorm SPAdes time |