| Literature DB >> 27673566 |
Esther Singer1, Bill Andreopoulos1, Robert M Bowers1, Janey Lee1, Shweta Deshpande1, Jennifer Chiniquy1, Doina Ciobanu1, Hans-Peter Klenk2, Matthew Zane1, Christopher Daum1, Alicia Clum1, Jan-Fang Cheng1, Alex Copeland1, Tanja Woyke1.
Abstract
Generating sequence data of a defined community composed of organisms with complete reference genomes is indispensable for the benchmarking of new genome sequence analysis methods, including assembly and binning tools. Moreover the validation of new sequencing library protocols and platforms to assess critical components such as sequencing errors and biases relies on such datasets. We here report the next generation metagenomic sequence data of a defined mock community (Mock Bacteria ARchaea Community; MBARC-26), composed of 23 bacterial and 3 archaeal strains with finished genomes. These strains span 10 phyla and 14 classes, a range of GC contents, genome sizes, repeat content and encompass a diverse abundance profile. Short read Illumina and long-read PacBio SMRT sequences of this mock community are described. These data represent a valuable resource for the scientific community, enabling extensive benchmarking and comparative evaluation of bioinformatics tools without the need to simulate data. As such, these data can aid in improving our current sequence data analysis toolkit and spur interest in the development of new tools.Entities:
Year: 2016 PMID: 27673566 PMCID: PMC5037974 DOI: 10.1038/sdata.2016.81
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Genome statistics of each mock community member.
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| Genome size includes chromosomes and plasmids. All genomes are available as finished sequences. Phylum associations for each strain are abbreviated as follows: AD—Acidobacteria, AT—Actinobacteria, B—Bacteroidetes, D—Deinococcus-Thermus, E—Euryarchaeota, F—Firmicutes, P—Proteobacteria, S—Spirochaetes, T—Thermotogae, V—Verrucomicrobia. Isolation sources were obtained from literature on respective strains, where available. GC content is based on genome size. Genomes without NCBI repeat region annotation are denoted with an *. | |||||||
|
| Soil | NC_018014 | 5227858 | 60.3 | 18.3 | 1 | 2 |
|
| Sewage | NC_003450 | 3309401 | 53.8 | NA* | 1 | 6 |
|
| Soil | NC_014211 | 6543312 | 72.7 | 0.2 | 2 | 5 |
|
| Human gingival crevice | NC_014363 | 2051896 | 64.7 | 0.46 | 1 | 1 |
|
| Human sputum | NC_014168 | 3157527 | 66.8 | 0.92 | 1 | 1 |
|
| Seawater collected in a mussel farm | NC_019904 | 5608040 | 44.8 | 4.34 | 1 | 4 |
|
| Hot spring (50 °C) | NC_014212 | 3721669 | 62.7 | 6.54 | 3 | 2 |
|
| Bovine | NC_008261 | 3256683 | 28.4 | 2.02 | 1 | 20 |
|
| Various | NC_009012 | 3843301 | 39 | 7.51 | 1 | 4 |
|
| Pond sediment | NC_018068 | 4991181 | 42.1 | 4.08 | 3 | 9 |
|
| Aquifer groundwater | NC_018515 | 4873567 | 41.8 | 2.89 | 1 | 11 |
|
| Freshwater mud | NC_021184 | 4855529 | 45.5 | 5.99 | 1 | 8 |
|
| Infected wound | NC_002737 | 1852441 | 38.5 | NA* | 1 | 6 |
|
| Composting reactor | NC_019897 | 4355525 | 60.1 | 7.14 | 2 | 5 |
|
| Human stool | NC_000913 | 4639675 | 50.8 | 6.7 | 1 | 7 |
|
|
| NC_017033 | 3603458 | 63.4 | 1.32 | 1 | 4 |
|
| Brackish water | NC_012982 | 3540114 | 45.2 | 0.45 | 2 | 2 |
|
| Cr-contaminated aquifer | NC_019936 | 4600489 | 62.5 | 1.83 | 4 | 4 |
|
| African frog | NC_015761 | 4460105 | 51.3 | 2.36 | 1 | 7 |
|
| Animal tissue | NC_010067 | 4600800 | 51.4 | 2.42 | 1 | 7 |
|
| Oil field | NC_014364 | 4653970 | 49 | 2.01 | 1 | 2 |
|
| Hot mud of spa | NC_017095 | 2166381 | 39 | 4.04 | 1 | 2 |
|
| Seawater | NC_014008 | 3750771 | 53.6 | 1.07 | 1 | 2 |
|
| Saline lake | CP003050.1 | 3223876 | 64.3 | NA* | 1 | 2 |
|
| Solar saltworks | NC_019792.1 | 3788356 | 62.2 | 4.22 | 1 | 3 |
|
| Lake | NC_019974.1 | 4314118 | 64.7 | 0.91 | 3 | 4 |
Figure 1Characteristics of MBARC-26 community.
Community members display diversity in phylogenetic distribution and relatedness (a), genome size (b), GC content (c), and repeat content normalized by genome size (d). Shades of the same color in (a) denote the same phylum association: Green—Proteobacteria, blue—Actinobacteria, purple—Firmicutes, yellow—Euryarchaeota.
Sequence Statistics by sequencing platform.
|
|
|
|
|---|---|---|
| Model | HiSeq-HO 2000 | RS II |
| Library chemistry | TruSeq paired-end cluster kit v3 | SMRTbell template preparation kit |
| Sequencing chemistry | TruSEq SBS sequencing kit 200 cycles v3 | P4C2 |
| Run mode | 2x150 | 1x120 min |
| # of raw reads | 355,875,608 | 300,584 |
| # of filtered reads | 347,963,988 | 53,654 |
| Average insert size [bp] | 219±43 | 1,041±576 |
| Average quality score (filtered reads) | Read 1: 33.47, Read 2: 32.04 | 0.976 |
Figure 2MBARC-26 community composition and relative abundance distribution, as based on Illumina and PacBio read mapping and mean DNA molarity.
Mock community members are grouped and arranged in order of % mapped sequences (Illumina). The observed discrepancy between molarity and % mapped PacBio and Illumina sequences in T. composti is likely due to contamination as T. composti was previously found to occur as laboratory contaminant in various shotgun metagenome datasets (unpublished data). The smaller discrepancies are expected due to DNA quantification spreads and platform biases. Colors denote phylum association as defined in Fig. 1.
Figure 3Quantitative comparison of MBARC-26 Illumina and PacBio shotgun sequence datasets.
(a) Community representation according to % mapped sequences for each mock community member in the PacBio (x-axis) and Illumina (y-axis) shotgun sequence datasets. (b) Percent chromosome coverage and fold coverage of each mock community genome by sequencing platform using unassembled sequences. Colors denote phylum association as defined in Fig. 1.