| Literature DB >> 31089679 |
Samuel M Nicholls1, Joshua C Quick1, Shuiquan Tang2, Nicholas J Loman1.
Abstract
BACKGROUND: Long sequencing reads are information-rich: aiding de novo assembly and reference mapping, and consequently have great potential for the study of microbial communities. However, the best approaches for analysis of long-read metagenomic data are unknown. Additionally, rigorous evaluation of bioinformatics tools is hindered by a lack of long-read data from validated samples with known composition.Entities:
Keywords: zzm321990 de novo assembly; Illumina; benchmark; bioinformatics; metagenomics; mock community; nanopore; real-time sequencing; single-molecule sequencing
Mesh:
Year: 2019 PMID: 31089679 PMCID: PMC6520541 DOI: 10.1093/gigascience/giz043
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Description of the 10 organisms comprising the ZymoBIOMICS Mock Community Standards
| Species | Type | Estimated size (Mb) | NRRL accession | ATCC accession | Sequence type | Illumina FASTQ | PacBio RSII FASTQ [ | PacBio Sequel FASTQ [ |
|---|---|---|---|---|---|---|---|---|
|
| Gram + | 4.05 | B-354 | 6633 | ST7 | ERR2935851 | SRR7498042 | SRR7415629 |
|
| Yeast | 18.90 | Y-2534 | 32045 | ERR2935856 | |||
| × | ||||||||
|
| Gram + | 2.85 | B-537 | 7080 | ST55 | ERR2935850 | SRR7415622 | SRR7415630 |
|
| Gram − | 4.88 | B-1109 | ST10 | ERR2935852 | SRR7498041 | ||
|
| Gram + | 1.91 | B-1840 | 14931 | ERR2935857 | |||
|
| Gram + | 2.99 | B-33116 | 19117 | ST449 | ERR2935854 | SRR7415624 | SRR7415635 |
|
| Gram − | 6.79 | B-3509 | 15442 | ST252 | ERR2935853 | SRR7498043 | |
|
| Yeast | 12.10 | Y-567 | 9763 | ERR2935855 | SRR7498048 | SRR7415638 | |
|
| Gram − | 4.76 | B-4212 | ST139 | ERR2935848 | SRR7415626 | SRR7415636 | |
|
| Gram + | 2.73 | B-41012 | ST9 | ERR2935849 | SRR7415627 | SRR7415637 | |
Table adapted from ZymoBIOMICS™ Microbial Community Standard II (Log Distribution) Instruction Manual v1.1.2 Table 2 and Appendix A. The S. enterica genome is listed at Agricultural Research Service Culture Collection (NRRL) (B-4212) as Serovar Typhimurium LT2, but our genomic analysis shows it is likely to be Serotype Choleraesuis, indicating possible mis-annotation. ATCC: American Type Culture Collection.
Summary of the 4 nanopore sequencing experiments
| Signal accession | FASTQ accession | Sequencer | Standard (lot) | Time (h) | Reads (M) | N50 (kb) | Quality (median Q) | Yield (Gb) | Q>7 (Gb) |
|---|---|---|---|---|---|---|---|---|---|
| ERR2887847 | ERR3152364 | GridION | Zymo CS Even ZRC190633 | 48 | 3.49 | 5.3 | 10.3 | 14.38 | 12.39 |
| ERR2887850 | ERR3152366 | GridION | Zymo CSII Log ZRC190842 | 48 | 3.67 | 5.4 | 9.8 | 16.51 | 13.97 |
| ERR2887848 | ERR3152365 | PromethION | Zymo CS Even ZRC190633 | 64 | 35.7 | 5.4 | 10.5 | 150.88 | 130.32 |
| ERR2887849 | PromethION | Zymo CS Even ZRC190633 | |||||||
| ERR2887851 | ERR3152367 | PromethION | Zymo CSII Log ZRC190842 | 64 | 34.5 | 5.4 | 10.7 | 153.31 | 133.68 |
| ERR2887852 | PromethION | Zymo CSII Log ZRC190842 | |||||||
PromethION runs were restarted following the standard 64-hour protocol. The table reflects total yield across both the standard run and subsequent restarts.
Figure 1Summary plots for the 4 generated data sets: (a) collector’s curve showing sequencing yield over time for each of the 4 sequencing runs, (b) density plot showing sequence accuracy (BLAST-like identities), (c) density plot showing sequencing speed over time by sequencing experiment.
Summary statistics for Illumina sequencing data
| Dataset | Pairs (M) | Yield (Gb) | phred ≥ 30 (%) | Accession |
|---|---|---|---|---|
| Isolates | 13.53 ± 5.23 | 2.73 ± 1.06 | 87.72 ± 5.43 | See Table |
| CS (Even) | 8.8 | 2.65 | 95.12 | ERR2984773 |
| CSII (Log) | 47.8 | 9.66 | 95.71 | ERR2935805 |
Illumina sequencing was performed on an Illumina HiSeq 1500, with the exception of the Even community, which was sequenced on an Illumina MiSeq.
Figure 2Proportion of sequenced bases assigned by minimap2 to each of the 10 organisms that were sequenced (x-axis), against the proportion of yield expected given the known composition (y-axis) of the Zymo CSII (Log) standard.
Read alignment statistics for Even samples, showing absolute measurements and proportion of sequencing yield and the estimated genome coverage obtained for each organism in the mock community
| GridION | PromethION | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Species | Expected proportion | Yield (Gb) | Measured proportion | Alignment N50 (kb) | Coverage (×) | Yield (Gb) | Measured proportion | Alignment N50 (kb) | Coverage (×) |
|
| 12 | 2.12 | 19.32 | 4.30 | 524.51 | 21.55 | 19.02 | 4.40 | 5,326.44 |
|
| 12 | 1.60 | 14.56 | 4.47 | 534.26 | 16.23 | 14.33 | 4.58 | 5,424.46 |
|
| 12 | 1.34 | 12.24 | 4.45 | 472.47 | 13.67 | 12.07 | 4.57 | 4,805.60 |
|
| 12 | 1.24 | 11.28 | 4.47 | 453.84 | 12.59 | 11.11 | 4.59 | 4,611.61 |
|
| 12 | 1.10 | 9.99 | 8.55 | 230.51 | 11.69 | 10.32 | 8.95 | 2,456.19 |
|
| 12 | 1.09 | 9.93 | 8.31 | 223.59 | 11.62 | 10.26 | 8.71 | 2,382.59 |
|
| 12 | 1.07 | 9.70 | 8.98 | 156.85 | 11.45 | 10.11 | 9.38 | 1,686.34 |
|
| 12 | 1.02 | 9.28 | 3.62 | 534.73 | 10.34 | 9.13 | 3.73 | 5,425.69 |
|
| 2 | 0.21 | 1.92 | 4.09 | 17.46 | 2.12 | 1.87 | 4.18 | 175.23 |
|
| 2 | 0.20 | 1.78 | 4.45 | 10.37 | 2.00 | 1.77 | 4.54 | 105.82 |
Read alignment statistics for Log samples, describing sequencing yield and estimated genome coverage obtained for each organism in the mock community
| GridION | PromethION | |||||
|---|---|---|---|---|---|---|
| Species | Yield (Gb) | Alignment N50 (kb) | Coverage (×) | Yield (Gb) | Alignment N50 (kb) | Coverage (×) |
|
| 12.10 | 4.95 | 4,043.90 | 110.09 | 4.97 | 36,796.21 |
|
| 1.10 | 9.38 | 161.45 | 9.99 | 9.33 | 1,471.41 |
|
| 0.16 | 5.03 | 38.67 | 1.44 | 5.04 | 356.00 |
|
| 0.08 | 4.78 | 6.93 | 0.75 | 4.75 | 62.33 |
|
| 0.01 | 9.20 | 2.20 | 0.10 | 9.17 | 20.04 |
|
| 0.01 | 8.65 | 2.14 | 0.09 | 9.17 | 19.24 |
|
| 4E−4 | 3.40 | 0.210 | 0.004 | 3.37 | 2.03 |
|
| 2E−4 | 7.62 | 0.055 | 1E−3 | 6.05 | 0.34 |
|
| 6E−5 | 4.41 | 0.003 | 7E−4 | 4.97 | 0.037 |
|
| 1E−5 | 7.12 | 0.005 | 5E−5 | 3.58 | 0.020 |
Note that expected and measured proportions are illustrated in Fig. 2.
Figure 3Bar plots demonstrating total length and contiguity of genomic assemblies obtained with wtdbg2 from each of the long-read nanopore data sets. For each organism in the community (coloured columns), contigs longer than 10 kb are horizontally stacked along the x-axis. Each row represents a run of wtdbg2, with the parameters for edge support, read length threshold, and homopolymer-compressed k-mer size labelled on the left. Assemblies are grouped by the data set on which they were run (row facets). Additionally, assemblies may be compared to the estimated true genome size, the available McIntyre et al. [17] PacBio assemblies, and per-isolate Illumina SPAdes assembly. Estimated genomes sizes are the same as those found in Table 1; however, to display approximate chromosomes, the 2 yeasts were replaced by their corresponding canonical National Center for Biotechnology Information references for visualization purposes only. The C. neoformans strain used by the Zymo standards is a diploid genetic cross, which may explain the larger assemblies, compared to the represented estimated haploid size.
Sequence identity dotplots and CheckM genome completeness scores for each of the 7 bacterial species for which there was a corresponding PacBio assembly from McIntyre et al. [17]
Four wtdbg2 assembly conditions are represented, varying the homopolymer-compressed k-mer parameter "p" and the graph minimum edge weight threshold “e.” The read length threshold “L” was fixed at 5,000 bp. The left and right halves of the table correspond to the same assembly condition for the GridION and 25% PromethION sequencing data, respectively. The L50/L95 refers to the number of assembled contigs required to span ≥50% and ≥95% of the estimated genome size (see Table 1). A minus sign indicates that the set of assembled contigs assigned to a taxon were not of sufficient total length to cover 95% of the estimated size. CheckM genome completeness scores are expressed as a percentage and were calculated per organism at the end of each polishing phase. bs: B. subtilis; ef: E. faecalis; ec: E. coli; lm: L. monocytogenes; pa: P. aeruginosa; se: S. enterica; sa: S. aureus.