| Literature DB >> 34900140 |
Chao Yang1, Debajyoti Chowdhury2,3, Zhenmiao Zhang1, William K Cheung1, Aiping Lu2,3, Zhaoxiang Bian4,5, Lu Zhang1,2.
Abstract
Metagenomic sequencing provides a culture-independent avenue to investigate the complex microbial communities by constructing metagenome-assembled genomes (MAGs). A MAG represents a microbial genome by a group of sequences from genome assembly with similar characteristics. It enables us to identify novel species and understand their potential functions in a dynamic ecosystem. Many computational tools have been developed to construct and annotate MAGs from metagenomic sequencing, however, there is a prominent gap to comprehensively introduce their background and practical performance. In this paper, we have thoroughly investigated the computational tools designed for both upstream and downstream analyses, including metagenome assembly, metagenome binning, gene prediction, functional annotation, taxonomic classification, and profiling. We have categorized the commonly used tools into unique groups based on their functional background and introduced the underlying core algorithms and associated information to demonstrate a comparative outlook. Furthermore, we have emphasized the computational requisition and offered guidance to the users to select the most efficient tools. Finally, we have indicated current limitations, potential solutions, and future perspectives for further improving the tools of MAG construction and annotation. We believe that our work provides a consolidated resource for the current stage of MAG studies and shed light on the future development of more effective MAG analysis tools on metagenomic sequencing.Entities:
Keywords: CNN, convolutional neural network; DBG, De Bruijn graph; GTDB, Genome Taxonomy Database; Gene functional annotation; Gene prediction; Genome assembly; HMM, Hidden Markov Model; KEGG, Kyoto Encyclopedia of Genes and Genomes; LCA, lowest common ancestor; LPA, label propagation algorithm; MAGs, metagenome-assembled genomes; Metagenome binning; Metagenome-assembled genomes; Metagenomic sequencing; Microbial abundance profiling; OLC, overlap-layout consensus; ONT, Oxford Nanopore Technologies; ORFs, open reading frames; PacBio, Pacific Biosciences; QC, quality control; SLR, synthetic long reads; TNFs, tetranucleotide frequencies; Taxonomic classification
Year: 2021 PMID: 34900140 PMCID: PMC8640167 DOI: 10.1016/j.csbj.2021.11.028
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Schematic representation of the different approaches used in the metagenomic research field. (A) A schematic contrast between culture-independent (metagenomics) methods, and culture-dependent methods. The process for generating sequencing data for the two strategies have been illustrated. (B) A schematic contrast between assembly-based and reference-based approaches on metagenomic sequencing data.
Tools for sequence quality control. For each tool, the sequencing technologies (column 2), the original publications (column 3), Characteristics (column 4) and the websites to download these tools (column 5) are illustrated. The sequence quality control tools and related content are explained in Section 2.1.
| Tools | Technologies | Publications | Characteristics | Websites |
|---|---|---|---|---|
| fastp | short reads, SLR and linked reads | Chen et al. 2018 | ultra-fast; exhaustive functions | |
| FastQC | short reads, SLR and linked reads | excellent visualization; exhaustive functions | ||
| Trimmomatic | short reads, SLR and linked reads | Bolger et al. 2014 | flexible and exhaustive functions | |
| SOAPnuke | short reads, SLR and linked reads | Chen et al. 2018 | reduced memory; predefined modules | |
| SequelTools | long reads | Hufnagel et al. 2020 | user-friendly; exhaustive functions | |
| proovread | long reads | Hackl et al. 2014 | iterative consensus; computationally efficient | |
| NanoPack | long reads | Coster et al. 2018 | exhaustive functions | |
| MinIONQC | long reads | Lanfear et al. 2019 | suitable for large projects referring to multiple samples. | |
| LongQC | long reads | Fukasawa et al. 2020 | platform-independent, computationally efficient and user-friendly |
Tools for metagenome assembly. For each assembler, the sequencing technologies (column 2), the original publications (column 3), and summaries of the core algorithms (column 4) and the websites to download these tools (column 5) are illustrated. The assemblers and related algorithms are explained in Section 2.2. DBG: De Bruijn graph; OLC: overlap-layout consensus.
| Tools | Technologies | References | Core algorithms | Websites |
|---|---|---|---|---|
| Omega | short reads | Haider et al. 2014 | OLC | |
| MetaVelvet | short reads | Namiki et al. 2012 | DBG | |
| MetaVelvet-SL | short reads | Afiahayati et al. 2015 | DBG | |
| MetaVelvet-DL | short reads | Liang et al. 2021 | DBG | |
| IDBA-UD | short reads | Peng et al. 2012 | DBG | |
| MEGAHIT | short reads | Li D et al. 2015 | DBG | |
| metaSPAdes | short reads | Nurk et al. 2017 | DBG | |
| Ray Meta | short reads | Boisvert et al. 2012 | DBG | |
| Athena-meta | linked reads | Bishara et al. 2018 | DBG | |
| cloudSPAdes | linked reads | Tolstoganov et al. 2019 | DBG | |
| Nanoscope | SLR | Kuleshov et al. 2016 | DBG | |
| Canu | long reads | Koren et al. 2017 | OLC | |
| NECAT | long reads | Chen et al. 2021 | String Graph | |
| wtdbg2 | long reads | Ruan et al. 2020 | Fuzzy Bruijn Graph | |
| metaFlye | long reads | Kolmogorov et al. 2020 | OLC | |
| DBG2OLC | short and long reads | Ye et al. 2016 | DBG and OLC | |
| OPERA-MS | short and long reads | Bertrand et al. 2019 | DBG | |
| Unicycler | short and long reads | Wick et al. 2017 | DBG |
Tools for assembly quality control. List of tools for assembly quality control. For each tool, requires reference genomes or not (column 2), the original publications (column 3) and the websites to download these tools (column 4) are illustrated. The quality control tools and related descriptions are presented in Section 2.3.
| Tools | Require reference genome | Publications | Websites |
|---|---|---|---|
| MetaQUAST | Yes | Mikheenko et al. 2016 | |
| REAPR | No | Hunt et al. 2013 | |
| VALET | No | Olson et al. 2019 | |
| DeepMAsED | No | Mineeva et al. 2020 | |
| CheckM | No | Parks et al. 2015 |
Tools for metagenome binning. For each tool, the adopted technologies (column 2), the original publications (column 3), summaries of the core algorithms (column 4), and the websites to download these tools (column 5) are illustrated. The metagenome binning tools and related descriptions are presented in Section 2.4.
| Tools | Technologies | Publications | Core algorithms | Websites |
|---|---|---|---|---|
| GroopM | short reads | Imelfort et al. PeerJ.2014 | Two-way clustering and Hough partitioning | |
| MaxBin2 | short reads | Wu et al. 2016 | Expectation-maximization | |
| CONCOCT | short reads | Alneberg et al. 2014 | Gaussian Mixture Models | |
| MetaBAT2 | short reads | Kang et al. 2019 | Label propagation | |
| MyCC | short reads | Lin et al. 2016 | Affinity propagation | |
| SolidBin | short reads | Wang et al. 2019 | Spectral clustering | |
| BMC3C | short reads | Yu G et al. 2018 | Ensemble clustering | |
| GraphBin | short reads | Mallawaarachchi et al. 2020 | Label propagation | |
| METAMVGL | short reads | Zhang et al. 2021 | Label propagation | |
| VAMB | short reads | Nissen et al. 2021 | Variational Autoencoders | |
| MAGO | short reads | Murovec et al.2020 | Ensemble learning | |
| MetaWRAP | short reads | Uritskiy et al. 2018 | Ensemble learning | |
| DAS Tool | short reads | Sieber et al. 2018 | Ensemble learning | |
| ProxiMeta | Hi-C | Press et al. 2017 | Graph-based clustering | |
| bin3C | Hi-C | DeMaere et al. 2019 | Network clustering | |
| HiCBin | Hi-C | Du et al. 2021 | Leiden algorithm |
Tools for gene prediction. For each tool, the method types (column 2), the original publications (column 3), summaries of the core algorithms (column 4) and the websites to download these tools (column 5) are illustrated. The gene prediction tools and related descriptions are presented in Section 3.1.
| Tools | Types | Publications | Core algorithms | Websites |
|---|---|---|---|---|
| MetaGeneMark | model based | Zhu et al. 2010 | Hidden Markov Model | |
| Glimmer-MG | model based | Kelley et al. 2012 | Interpolated Markov Model | |
| FragGeneScan | model based | Delcher et al. 2007 | Hidden Markov Model | |
| Prodigal | model based | Hyatt et al. 2010 | Dynamic Programming | |
| MetaGene | model based | Noguchi et al. 2006 | Dynamic Programming | |
| MetaGeneAnnotator | model based | Noguchi et al. 2008 | Dynamic Programming | |
| Meta-MFDL | machine learning | Biomed Res et al. 2017 | Deep Neural Network | |
| CNN-MGP | machine learning | Al-Ajlan et al. 2019 | Convolutional Neural Network | |
| Balrog | machine learning | Sommer et al. 2021 | Convolutional Neural Network |
Tools for gene annotation. For each tool, the method types (column 2), the original publications (column 3), summaries of the core algorithms/programs (column4) and the websites to download these tools (column 5) are illustrated. The gene annotation tools and related descriptions are presented in Section 3.2.
| Tools | Types | Publications | Core algorithms | Websites |
|---|---|---|---|---|
| eggNOG-mapper | Homology-based | Huerta-Cepas et al. 2017 | Hidden Markov Model | |
| GhostKOALA | Homology-based | Kanehisa et al. 2016 | GHOSTX (seed search method) | |
| MG-RAST | Homology-based | Keegan et al. 2016 | Parallelized BLAT | |
| PANNZER2 | Homology-based | Törönen et al. 2018 | Sansparallel (suffix array neighborhood search) | |
| InterProScan | Motif-based | Quevillon et al. 2005 | Phobius (Hidden Markov Model) | |
| GeConT | Gene context based | Ciria et al. 2004 | Blastp | |
| FunGeCo | Gene context based | Anand et al. 2020 | Hidden Markov Model | |
| FlaGs | Gene context based | Saha et al. 2021 | Jackhmmer (Hidden Markov Model) |
Tools for MAG taxonomic classification. For each tool, the method types (column 2), the original publications (column 3), summaries of the core algorithms (column 4) and the websites to download these tools (column 5) are illustrated. The detailed description is presented in Section 3.3.
| Tools | Types | Publications | Core algorithms | Websites |
|---|---|---|---|---|
| GTDB-Tk | concatenated protein | Chaumeil et al. 2019 | Likelihood-based phylogenetic inference | |
| ezTree | concatenated protein | Wu et al. 2018 | Maximum likelihood | |
| PhyloPhlAn3 | concatenated protein | Asnicar et al. 2020 | Maximum likelihood | |
| MiGA | genome-based relatedness | Rodriguez-R et al. 2018 | Markov clustering |
Tools for profiling MAG abundance. For each tool, the method types (column 2), the original publications (column 3), summaries of the core algorithms (column 4) and the websites to download these tools (column 5) are illustrated. The gene prediction tools and related descriptions are presented in Section 3.4.
| Tools | Types | Publications | Core algorithms | Websites |
|---|---|---|---|---|
| Kaiju | translated protein based | Menzel P et al. 2016 | Backwards search | |
| Kraken | Wood DE et al. 2014 | Classification tree | ||
| Kraken2 | Wood DE et al. 2019 | Spaced seed | ||
| Bracken | Jennifer Lu et al. 2017 | Bayesian probability algorithm | ||
| CLARK | Ounit R et al. 2015 | Spectral decomposition | ||
| k-SLAM | Ainsworth D et al. 2017 | Pseudo-assembly | ||
| MetaPhlAn3 | marker gene based | Beghini F et al. 2021. | Comprehensive pipeline | |
| PanPhlAn3 | marker gene based | Beghini F et al. 2021. | Comprehensive pipeline | |
| IGGsearch | marker gene-based | Nayfach S et al. 2019 | Comprehensive pipeline | |
| ConStrain | SNP based | Luo C et al. 2015 | SNP-flow algorithm | |
| StrainFinder | SNP based | Smillie CS et al. 2018 | Expectation-maximization | |
| StrainEst | SNP based | Albanese D et al. 2017 | Penalized optimization | |
| StrainPhlAn3 | SNP based | Beghini F et al. 2021. | Comprehensive pipeline |