| Literature DB >> 34402639 |
Izaak Coleman1, Tal Korem1,2,3.
Abstract
A central paradigm in microbiome data analysis, which we term the genome-centric paradigm, is that a linear (non-branching) DNA sequence is the ideal representation of a microbial genome. This representation is natural, as microbes indeed have non-branching genomes. Tremendous discoveries in microbiology were made under this paradigm, but is it always optimal for microbiome research? In this Commentary, we claim that the realization of this paradigm in metagenomic assembly, a fundamental step in the "metagenomics analysis pipeline," suboptimally models the extensive genomic variability present in the microbiome. We outline our efforts to address these issues with a "genome-free" approach that eschews linear genomic representations in favor of a pan-metagenomic graph.Entities:
Keywords: assembly; genomics; metagenomics; microbiome
Year: 2021 PMID: 34402639 PMCID: PMC8407213 DOI: 10.1128/mSystems.00816-21
Source DB: PubMed Journal: mSystems ISSN: 2379-5077 Impact factor: 6.496
FIG 1(A) Visual comparison between a genome; a metagenome, the collection of all genomes from a sampled microbial community; and a pan-metagenome, a collection of genomes, each deriving from one of multiple sampled communities. (B) The heterogeneity caveat: genomic variation between closely-related genomes (dashed sections) induces branching structures in assembly graphs (dashed nodes and edges). Linear assembly breaks down these structures, resulting in either fragmented contigs or the removal of variable regions. (C) The abundance caveat: undersampling of low-abundance genomes creates gaps in their assemblies. Co-assembly attempts to exploit information from close-matching genomes in other samples (red path) to fill these gaps. Some regions from these genomes are identical (diagonally striped nodes) and facilitate co-assembly; others are divergent, and introduce additional branching to the graph. This may result in either chimeras or fragmented contigs, and lower-quality assemblies in general. (D) We propose a graph-based representation of the pan-metagenome that addresses the caveats of the current paradigm. Our representation models metagenomic data across multiple samples, while keeping track of the originating sample of each sequence (red, black, and green). Sequence homology is used to collapse similar genomic regions (overlapping nodes), attenuating excessive branching within the graph in order to reveal variation at different scales with no information loss.