| Literature DB >> 30736849 |
Adrian Fritz1, Peter Hofmann1,2, Stephan Majda1,2, Eik Dahms1,2, Johannes Dröge1,2, Jessika Fiedler1,2, Till R Lesker1,3, Peter Belmann1,4, Matthew Z DeMaere5, Aaron E Darling5, Alexander Sczyrba4, Andreas Bremges1,3, Alice C McHardy6,7.
Abstract
BACKGROUND: Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required.Entities:
Keywords: Benchmarking; CAMI; Genome binning; Metagenome assembly; Metagenomics software; Microbial community; Simulation; Taxonomic binning; Taxonomic profiling
Mesh:
Year: 2019 PMID: 30736849 PMCID: PMC6368784 DOI: 10.1186/s40168-019-0633-6
Source DB: PubMed Journal: Microbiome ISSN: 2049-2618 Impact factor: 14.650
Fig. 1UML diagram of the CAMISIM workflow. CAMISIM starts with the “community design” step, which can either be de novo, requiring a taxon mapping file and reference genomes or based on a taxonomic profile. This step produces a community genome and taxon profile which is used for the metagenome simulation using one of currently four read simulators (ART, wgsim, PBsim, NanoSim). The resulting reads and bam-files mapping the reads to the original genomes are used to create the gold standards before all the files can be anonymized and shuffled in the post-processing step
Properties of popular metagenome sequence simulators
| Software | De novo | Profile | Multi | Strains | Non-Illumina data | Licensed | Updated |
|---|---|---|---|---|---|---|---|
| MetaSim [ | ✓ | X | X | ✓ | 454 | P, AU | 03/2009 |
| iMESS [ | ✓ | X | X | X | 454 | – | 07/2014 |
| BBMap [ | ✓ | X | X | X | – | LBL | 01/2019 |
| NeSSM [ | ✓ | ✓ | X | X | 454 | AU | 07/2013 |
| BEAR [ | ✓ | ✓ | X | X | – | AU | 02/2017 |
| FASTQSim [ | ✓ | ✓ | X | X | SOLiD, IonTorrent, PacBio | GPL | 05/2015 |
| Grinder [ | ✓ | ✓ | ✓ | X | Sanger, 454 | GPL | 04/2016 |
| CAMISIM | ✓ | ✓ | ✓ | ✓ | PacBio, ONT, … | Apache 2.0 | 01/2019 |
Abbreviations: P, proprietary software; AU, academic use only; LBL, Lawrence Berkeley Lab
The table shows if an abundance profile can be generated by the simulator de novo and if an existing profile of a microbial community can be used as input. Further inspected features are the ability to simulate multi-sample data sets, strains, and non-Illumina data (e.g., long reads). Lastly, the table states if and how a software is licensed, and the date it was last recently updated
Fig. 2Assembly graphs become more complex as coverage increases. MEGAHIT assembly graphs (k = 41) of an E. coli K12 genome for 2 ×, 32 ×, and 512 × per-base coverage, respectively, visualized with Bandage [60]. For 2 × coverage, the graph is disconnected and thus the assembly fragmented. With increasing coverage more and more unitigs can be joined, first resulting in a decent assembly for 32 × coverage, but—due to sequencing errors adding erroneous edges to the graph—a fragmented assembly again for 512 × coverage
Fig. 3Coverage dependent assembly performance for MEGAHIT and metaSPAdes. Shown are the metrics, from top to bottom: genome fraction in %, number of contigs, and NGA50 (as reported by QUAST [61]), for 0%, 2%, and 5% uniform error rate, and with the ART CAMI error profile compared to the best possible metrics (gold standard) on the ART CAMI profile (dashed black)
Fig. 4Genome fraction calculated using unique or multiple best mappings in case of ties to the community genome collection. Left: genome fraction for the E. coli assembly created by MEGAHIT from error-free reads (top) and with ART CAMI error profile (bottom). Right: average genome fraction and standard deviation for all original 152 iTol genomes created by MEGAHIT from error-free reads (top) and with ART CAMI error profile (bottom). Error bars denote 1 × standard deviation
Fig. 5Comparison of CAMISIM and PICRUSt functional profiles for different body sites. a NMDS ordination of the functional predictions of individual samples by the different methods. The different body sites are color-coded and labeled with their sample number. The original WGS is denoted by squares, the CAMISIM result as circles and the PICRUSt result as triangles. b Mean and standard deviation of Pearson and Spearman correlation to original WGS samples per body site. C, CAMISIM; P, PICRUSt