Literature DB >> 35118380

nf-core/mag: a best-practice pipeline for metagenome hybrid assembly and binning.

Sabrina Krakau¹, Daniel Straub¹, Hadrien Gourlé², Gisela Gabernet¹, Sven Nahnsen¹.

Abstract

The analysis of shotgun metagenomic data provides valuable insights into microbial communities, while allowing resolution at individual genome level. In absence of complete reference genomes, this requires the reconstruction of metagenome assembled genomes (MAGs) from sequencing reads. We present the nf-core/mag pipeline for metagenome assembly, binning and taxonomic classification. It can optionally combine short and long reads to increase assembly continuity and utilize sample-wise group-information for co-assembly and genome binning. The pipeline is easy to install-all dependencies are provided within containers-portable and reproducible. It is written in Nextflow and developed as part of the nf-core initiative for best-practice pipeline development. All codes are hosted on GitHub under the nf-core organization https://github.com/nf-core/mag and released under the MIT license.

Entities: Chemical

Year: 2022 PMID： 35118380 PMCID： PMC8808542 DOI： 10.1093/nargab/lqac007

Source DB: PubMed Journal: NAR Genom Bioinform ISSN： 2631-9268

INTRODUCTION

Shotgun metagenomic approaches enable genomic analyses of all microbes within, for example, environmental or host-associated microbiome communities. Since most bacteria cannot be cultured, isolated and individually sequenced, reference databases for microbial genomes are often incomplete. Thus, one of the main tasks in metagenomic data analysis is to reconstruct the individual genomes directly from the given mixture of metagenomic reads. A typical reference-independent metagenomic workflow consists of preprocessing raw reads, assembly and binning to generate so-called metagenome assembled genomes (MAGs), as well as taxonomic and functional annotation of MAGs. Assemblies based on short reads typically suffer from being highly fragmented. In contrast, long reads can be used to generate continuous assemblies but suffer from high error rates. Hybrid assembly approaches combine the advantages of both short and long reads and thus, can produce continuous and accurate assemblies at reasonable cost (1). Another important aspect is whether to combine information across samples for assembly. When analyzing multiple samples that contain the same microbes (e.g. cultures and time series), co-assembly increases sequencing depth and can thus improve assembly completeness, in particular with respect to low abundant genomes. The sample-wise sequencing depth information can be used to aid binning methods to decide what contigs should form a MAG. Co-assembly also allows to directly track MAGs through samples instead of computing links between MAGs of multiple assemblies. On the other hand, co-assembly can increase the metagenome complexity and result in more fragmentation or hybrid contigs of multiple similar genomes (2,3). Several pipelines have been developed for the assembly and binning of metagenomes (4–6). However, only a few pipelines such as Muffin (7) and ATLAS (8) make use of workflow management systems, such as Snakemake (9) or Nextflow (10), which facilitate scalability, portability, reproducibility and ease of application. These pipelines have different strengths and weaknesses but only Muffin supports hybrid assembly and none of them supports co-assembly. Here we introduce nf-core/mag, a Nextflow pipeline for hybrid assembly of metagenomes, binning and taxonomic classification of MAGs. The nf-core/mag pipeline is ideal for standardized, large-scale and high-throughput analysis. It is also versatile and can use mixed sequencing data (hybrid assembly) or single sequencing technology data, and allows the use of group information to perform co-assembly and the computation of co-abundances used for genome binning. The pipeline is part of the nf-core collection of community curated best-practice pipelines (11).

MATERIALS AND METHODS

Implementation and reproducibility

nf-core/mag is written in Nextflow, making use of the new DSL2 syntax (see Supplementary Data: Section S2). DSL2 enables a modularized pipeline structure, where each individual process, containing ideally only one tool, is provided as a ‘module’. Additionally, sub-workflows can be integrated. The pipeline strongly benefits from the nf-core framework, which enforces a set of strict best-practice guidelines to ensure high-quality pipeline development and maintenance (11). For example, pipelines must provide comprehensive documentation as well as community support via GitHub Issues and dedicated Slack channels. Pipeline portability and reproducibility are enabled through (i) pipeline versioning (i.e. tagged releases on GitHub), (ii) building and archiving associated containers that contain the required software dependencies (i.e. the exact same compute environment can be used over time and across systems) and (iii) a detailed reporting of the used pipeline/software versions and applied parameters. The use of container technologies such as Docker and Singularity enable reproducibility and portability also across different compute systems, i.e. local computers, HPC clusters and cloud platforms. The nf-core/mag pipeline comes with a small test dataset that is used for continuous integration (CI) testing with GitHub Actions. In addition, ‘full-size’ pipeline tests are run on AWS for each pipeline release to ensure cloud compatibility and an error-free performance on real-world datasets. The full-size test results for each pipeline release are displayed on the nf-core website (https://nf-co.re/mag/results). Moreover, since the nf-core framework uses DSL2, commonly used processes can be shared across pipelines via nf-core/modules (https://github.com/nf-core/modules). This allows the efficient integration of new analysis tools into the pipeline in the future. Although the nf-core framework facilitates reproducibility at several layers, it is further crucial to ensure that the individual tools that are part of the pipeline can be run in a deterministic and reproducible manner. For this purpose, nf-core/mag offers dedicated reproducibility settings, for example, to set a random seed parameter, to fix and report multi-threading parameters as well as to generate and/or save required databases, whose public versions do not always remain accessible (see Supplementary Data: Section S3).

Simulation of metagenomic data

To show exemplary results generated with the nf-core/mag pipeline, we simulated metagenomic time series data with CAMISIM (12). For this, CAMISIM was applied based on the genome sources from the ‘CAMI II challenge toy mouse gut dataset’ (13), containing 791 genomes, while set to generate Illumina and Nanopore reads. Two groups of samples were simulated, each comprising a time series of four samples (for details see Supplementary Data: Section S4). The simulated datasets as well as a sample sheet file, which can be used as input for the nf-core/mag pipeline, are available at https://doi.org/10.5281/zenodo.5155395.

RESULTS

Pipeline overview

An overview of the nf-core/mag pipeline is shown in Figure 1A. The input can be either directly provided FASTQ files containing the short reads or a sample sheet in CSV format containing the paths to short and, optionally, long read files as well as additional group information.

Figure 1.

(A) Overview of the nf-core/mag pipeline (v2.1.0). (B) Clustered heatmap showing MAG abundances, i.e. centered log-ratio depths across samples. (C) Schematic representation of MAG summary output, containing abundance information, QC metrics and taxonomic classifications.

Pre-processing

The pipeline starts with preprocessing the raw reads. For short Illumina reads, fastp (14) is used for adapter and quality trimming, Bowtie2 (15) for identifying and removing host or PhiX reads, and FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) for quality control (QC) on the raw and preprocessed reads. For long Nanopore reads, porechop (https://github.com/rrwick/Porechop) is used for adapter trimming, NanoLyse (https://github.com/rrwick/Filtlong) to remove phage lambda control contamination, and Filtlong (16) for quality filtering (host read contamination is indirectly removed based on the filtered short reads). QC on the raw and processed long reads is performed using NanoPlot (https://github.com/rrwick/Filtlong).

Assembly and binning

The preprocessed reads are then de novo assembled using MEGAHIT (17) or SPAdes (18). If both short and long reads are provided, a hybrid assembly can be performed using hybridSPAdes (19). By default, nf-core/mag assembles the reads of each sample individually. However, it provides the option to compute co-assemblies according to user specified group information. MetaBAT2 (20) is then used to bin the contigs into individual MAGs based on nucleotide frequencies and co-abundance patterns across samples (within the same group, by default). The pipeline further estimates MAG abundances for the different samples from contig sequencing depths. QUAST (21) summarizes QC features of the generated assemblies and MAGs. MAG completeness and contamination is estimated by BUSCO (22), which makes use of near-universal single-copy orthologs.

Taxonomic classification

Finally, MAGs are taxonomically annotated using GTDB-TK (23) or CAT/BAT (24). While CAT/BAT is able to taxonomically classify any MAG, GTDB-TK requires a number of marker genes and is therefore only applied to MAGs passing quality thresholds regarding the completeness and contamination priorly estimated with BUSCO. Besides the results from the individual tools, nf-core/mag outputs a summary containing estimated abundances, as well as the main QUAST, BUSCO and GTDB-TK metrics for each MAG (see Figure 1C).

Quality assurance

Preprocessed short reads are classified using Kraken2 (25) or Centrifuge (26) and visualized in Krona charts (27) to assess potential contamination and the microbial community before the assembly. MultiQC (28) is used to generate a comprehensive quality report aggregating the QC results across all samples. Table 1 shows a comparison of nf-core/mag’s functionality to other existing pipelines for metagenome assembly and binning. The respective analysis tools used in nf-core/mag were chosen based on benchmarking results from the CAMI challenge (13,29) and based on specific user requests from the scientific community. A more detailed pipeline comparison listing also the individual tools as well as a brief discussion about tool choices can be found in Supplementary Data: Section S1.

Table 1.

Comparison of nf-core/mag’s functionality with commonly used metagenome assembly and binning pipelines. A more detailed comparison is shown in Supplementary Table S1

	Functionality	Muffin v1.0.3	ATLAS v2.6a2	nf-core/mag v2.1.0
Assembly	Hybrid assembly	Yes	Partial	Yes
	Reassembly after binning	Yes	No	No
	Group-wise co-assembly	No	No	Yes
Genome binning	Group-wise co-abundances used for binning	No	Yes	Yes
	Bin refinement	Yes	Yes	No
	MAG abundance estimation	No	Yes	Yes
Annotation	Taxonomic classification	Yes	Yes	Yes
	Functional annotation	Yes	Yes	No
Usability	Reproducibility	No	No	Yes
	Adherence to set of strict best-practice guidelines for pipeline development	No	No	Yes

Comparison of nf-core/mag’s functionality with commonly used metagenome assembly and binning pipelines. A more detailed comparison is shown in Supplementary Table S1

Exemplary results

We ran nf-core/mag v2.1.0 on the metagenomic time series data simulated with CAMISIM. Figure 1B shows an example heatmap representing MAG abundances across samples, obtained with nf-core/mag performing hybrid, group-wise co-assembly. To illustrate the possible impact of the assembly setting, we compared the results for four different nf-core/mag settings, i.e. (i) short read only, sample-wise assembly, (ii) hybrid, sample-wise assembly, (iii) short read only, group-wise co-assembly and (iv) hybrid, group-wise co-assembly. Figure 2 shows a comparison of the resulting assemblies with respect to commonly used assembly metrics. The results demonstrate that—for this particular time series data—both hybrid assembly as well as group-wise co-assembly increase the assembly's size, its N50 value and the number of reconstructed MAGs, and thus likely the overall assembly completeness. For further details and results see Supplementary Data: Section S5.

Figure 2.

Assembly metrics obtained using different nf-core/mag assembly settings on the simulated data: sample-wise assembly, group-wise co-assembly, short read only assembly or hybrid assembly. Each point corresponds to one assembly, originating either from one sample or one group. Metrics displayed are (A) total length of the assembly in base pairs, (B) N50 value (i.e. the length of the shortest contig that needs to be included to cover least 50% of the genome), (C) number of contigs in the final assembly, (D) size of the largest contig in base pairs and (E) number of MAGs identified in the final assembly. The metrics (A)–(D) are part of the QUAST assembly summary.

DISCUSSION

We implemented nf-core/mag, an easy to install, reproducible and portable pipeline for hybrid assembly, binning and taxonomic classification of metagenomes. It facilitates the analysis of large metagenomic datasets, while allowing the generation of results that can be reproduced by other scientists. It provides comprehensive usage documentation (https://nf-co.re/mag) and community support via a dedicated Slack channel (https://nfcore.slack.com/channels/mag). One important advantage over existing pipelines is that it can utilize sample-wise group information to perform co-assembly and/or to compute co-abundances used for the genome binning step. This can be particularly useful for the analysis of enrichment cultures, of longitudinal datasets—as often generated in clinical studies—or for studies with interest also in lower abundant genomes. Users are enabled to choose the approach most suitable to their specific research question and experimental setup, or even to compare different settings. The pipeline was already successfully applied in microbial studies (30,31) and, as part of nf-core, will be constantly maintained and developed further to keep up with state-of-the-art analysis methods. We envision for the future that the here presented version 2.1.0 of nf-core/mag will be further improved, for example, by adding functional annotation as well as assembly and bin refinement steps. The modular DSL2 structure efficiently allows future extensions with new tools and integrations with other (sub-)workflows, for example, for the analysis of metatranscriptomic data. As a community effort focusing on best-practices, an ongoing aim is to join forces with other tool or pipeline developers in the field of metagenomics. At the time of writing this article, for instance, work on a bin refinement step was already started by members of the nf-core community.

DATA AVAILABILITY

nf-core/mag code is hosted on GitHub under the nf-core organization https://github.com/nf-core/mag and released under the MIT license. The with CAMISIM simulated metagenomic data used to generate the exemplary results is available at https://doi.org/10.5281/zenodo.5155395. Click here for additional data file.

30 in total

1. hybridSPAdes: an algorithm for hybrid assembly of short and long reads.

Authors: Dmitry Antipov; Anton Korobeynikov; Jeffrey S McLean; Pavel A Pevzner
Journal: Bioinformatics Date: 2015-11-20 Impact factor: 6.937

Review 2. MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices.

Authors: Dinghua Li; Ruibang Luo; Chi-Man Liu; Chi-Ming Leung; Hing-Fung Ting; Kunihiko Sadakane; Hiroshi Yamashita; Tak-Wah Lam
Journal: Methods Date: 2016-03-21 Impact factor: 3.608

Review 3. Tutorial: assessing metagenomics software with the CAMI benchmarking toolkit.

Authors: Fernando Meyer; Till-Robin Lesker; David Koslicki; Adrian Fritz; Alexey Gurevich; Aaron E Darling; Alexander Sczyrba; Andreas Bremges; Alice C McHardy
Journal: Nat Protoc Date: 2021-03-01 Impact factor: 13.491

4. Inclusion of Oxford Nanopore long reads improves all microbial and viral metagenome-assembled genomes from a complex aquifer system.

Authors: Will A Overholt; Martin Hölzer; Patricia Geesink; Celia Diezel; Manja Marz; Kirsten Küsel
Journal: Environ Microbiol Date: 2020-08-20 Impact factor: 5.491

5. The nf-core framework for community-curated bioinformatics pipelines.

Authors: Philip A Ewels; Alexander Peltzer; Sven Fillinger; Harshil Patel; Johannes Alneberg; Andreas Wilm; Maxime Ulysse Garcia; Paolo Di Tommaso; Sven Nahnsen
Journal: Nat Biotechnol Date: 2020-03 Impact factor: 54.908

6. MultiQC: summarize analysis results for multiple tools and samples in a single report.

Authors: Philip Ewels; Måns Magnusson; Sverker Lundin; Max Käller
Journal: Bioinformatics Date: 2016-06-16 Impact factor: 6.937

7. metaSPAdes: a new versatile metagenomic assembler.

Authors: Sergey Nurk; Dmitry Meleshko; Anton Korobeynikov; Pavel A Pevzner
Journal: Genome Res Date: 2017-03-15 Impact factor: 9.043

8. ATLAS: a Snakemake workflow for assembly, annotation, and genomic binning of metagenome sequence data.

Authors: Silas Kieser; Joseph Brown; Evgeny M Zdobnov; Mirko Trajkovski; Lee Ann McCue
Journal: BMC Bioinformatics Date: 2020-06-22 Impact factor: 3.169

9. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies.

Authors: Dongwan D Kang; Feng Li; Edward Kirton; Ashleigh Thomas; Rob Egan; Hong An; Zhong Wang
Journal: PeerJ Date: 2019-07-26 Impact factor: 2.984

10. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes.

Authors: Mosè Manni; Matthew R Berkeley; Mathieu Seppey; Felipe A Simão; Evgeny M Zdobnov
Journal: Mol Biol Evol Date: 2021-09-27 Impact factor: 16.240

2 in total

1. Gut Microbiome Changes Occurring with Norovirus Infection and Recovery in Infants Enrolled in a Longitudinal Birth Cohort in Leon, Nicaragua.

Authors: Jennifer L Cannon; Matthew H Seabolt; Ruijie Xu; Anna Montmayeur; Soo Hwan Suh; Marta Diez-Valcarce; Filemón Bucardo; Sylvia Becker-Dreps; Jan Vinjé
Journal: Viruses Date: 2022-06-27 Impact factor: 5.818

2. MAGNETO: An Automated Workflow for Genome-Resolved Metagenomics.

Authors: Benjamin Churcheward; Maxime Millet; Audrey Bihouée; Guillaume Fertin; Samuel Chaffron
Journal: mSystems Date: 2022-06-15 Impact factor: 7.324

2 in total