Literature DB >> 36029242

expam-high-resolution analysis of metagenomes using distance trees.

Sean M Solari^1,2, Remy B Young^1,2, Vanessa R Marcelino^1,2, Samuel C Forster^1,2.

Abstract

SUMMARY: Shotgun metagenomic sequencing provides the capacity to understand microbial community structure and function at unprecedented resolution; however, the current analytical methods are constrained by a focus on taxonomic classifications that may obfuscate functional relationships. Here, we present expam, a tree-based, taxonomy agnostic tool for the identification of biologically relevant clades from shotgun metagenomic sequencing.
AVAILABILITY AND IMPLEMENTATION: expam is an open-source Python application released under the GNU General Public Licence v3.0. expam installation instructions, source code and tutorials can be found at https://github.com/seansolari/expam. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2022 PMID： 36029242 PMCID： PMC9563691 DOI： 10.1093/bioinformatics/btac591

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

Microbial communities perform essential functions in a variety of ecosystems (Danovaro ) including the human body (Lloyd-Price ), where compositional changes have been correlated with diseases from inflammatory bowel disease (Ni ) to cancers (Frankel ) and autoimmune diseases (Brown ). Shotgun metagenomic sequencing now represents the gold-standard for rapid assessment of the functional capacity and composition of these microbial communities. Applying the reference-based metagenomic analysis to these datasets (Beghini ; Brady and Salzberg, 2009; LaPierre ; Milanese ; Wood ), shotgun reads are compared against sequence collections to ascertain the taxonomic distribution of species within the community (Forster ; Lloyd-Price ). While taxonomy provides an important standard for describing and comparing microbes, prokaryotic taxonomic groups do not necessarily capture precise genomic relationships (Fraser ). Specifying the resolved hierarchical structure between reference genomes enables clade-specific functional associations, thereby facilitating an ability to understand phenotypic relationships at a resolution lost using taxonomy. Here, we implement this concept in a software tool called expam. expam provides precise phylogenetic profiling of metagenomic data using highly resolved trees, simultaneously analysing shotgun data for signs of novel biological sequence.

2 Materials and methods

2.1 expam database

Construction of the expam database requires two sources of data: a collection of reference sequences, and a Newick tree specifying their relationship. Optimal classification performance requires accurate, high-resolution trees; while tree specification is left at the user’s discretion, this criterion makes distance-based and phylogenetic trees primary candidates. Like many k-mer-based metagenome profilers, the database consists of a key-value store, with each key being a k-mer from some reference sequence. However, each database value now refers to that node within the tree which is the lowest common ancestor (LCA) of all reference sequences containing the corresponding key, rather than the shared taxonomic ancestor (Fig. 1A). To construct this database, expam uses Python multiprocessing to concurrently extract and sort k-mers (Knuth, 1998; Marcais and Kingsford, 2011) from all reference sequences, before then mapping these k-mers to their LCA. The resulting k-mer and LCA NumPy arrays (Harris ) are compressed on disk using the PyTables library, and loaded into shared memory during sample processing for parallel read classification.

Fig. 1.

Overview of the expam pipeline using two synthetic metagenomes. (A) k-mers are extracted from each metagenomic read and mapped against an expam database. (B) The k-mer distribution of this read is analysed and classified within the reference tree (gold stars). (C) Reads classifications are accumulated, and the phylogenetic distribution of various samples can be plotted and compared

2.2 Classification algorithm

Within the highly resolved tree, each read has some k-mer distribution, or the set of nodes that k-mers from this read are mapped to. The k-mer distribution of any sequence present in some reference S must lie within the root-to-leaf path terminating at S. Metagenomic reads can therefore present either as single-lineage (SL) reads, or split-lineage reads (hereafter splits), whose k-mers are distributed along one or multiple lineages, respectively. In both cases, reads are assigned to the lowest common node of all lineages (Fig. 1B). However, high split counts in a particular region of the tree suggest the presence of microbial isolates lacking reference genomes in the database. The inclusion of specific reference sequences belonging to these under-represented clades can therefore enable targeted classification improvement. A heuristic α parameter filters low abundance lineages in the k-mer distribution (Supplementary Equation S1), such as those arising from sequencing error. The default α parameter value is suitable for general use cases. Finally, identified clades from each sample are available as raw counts in standard Kraken output format and visualized by expam in the reference phylogenetic tree (Fig. 1C).

2.3 Converting classifications to NCBI taxonomy

Despite the disadvantages of taxonomy for read classification, it remains a valuable tool for the communication of findings. To obtain a taxonomic summary of tree-based classifications, expam maps each point in the reference tree to the LCA of all taxonomic lineages among reference sequences below this point. These results are output in the same standardized Kraken output format.

3 Results

We compared expam’s performance to a collection of widely used metagenomic profilers (Beghini ; Gruber-Vodicka ; Marcelino ; Müller ; Wood ) (Supplementary Table S1) on 140 publicly available simulated metagenomic communities (Parks ), stratified by four distinct classes: either low or high species diversity, and single or multiple strains (Supplementary Table S2). To standardize classifier performance, the RefSeq collection (release 203) of genome sequences was used as a reference for all software with the capacity to build a custom database, default databases being used for phyloFlash and MetaPhlAn3. Read-level analysis of classifier performance was used to determine the assignment accuracy of each read, and taxonomic summaries assessed the total set of taxa estimated to be in the sample (Supplementary Methods). Our results demonstrate that expam achieves stringent taxonomic and read-level species precisions of 84.0% and 63.9%, respectively, when averaged across the 140 samples (Supplementary Figs S1, S2, and Table S3) exhibiting a robustness to spurious read classifications (Anyasi ) that contrasted the results of Kraken2 (read-level 74.1%; taxonomic 4.1%) and MetaCache (read-level 86.9%; taxonomic 11.1%). Of all tools using the standardized database, expam achieves the highest average species-level taxonomic F1 score of 0.575, with the next highest score 0.211 achieved by CCMetagen (Supplementary Figs S3 and S4). Notably, expam achieved an average taxonomic recall of 55.8%, a 23% decrease from the top recall score (Kraken2, 72.2%) (Supplementary Figs S5 and S6); however, expam’s taxonomic recall generally depends on the degree to which the reference tree and NCBI taxonomy align. To gauge sensitivity of runtime statistics to k-mer length and number of reference genomes, a collection of six expam databases were built varying number of reference sequences and k-mer length (Supplementary Tables S4–S7) before being tested against simulated metagenomes. While precision and recall increased with references, build and classification memory also increased with the amount of reference sequence (Supplementary Fig. S7). Classification time and memory usage were relatively stable for larger k, being determined predominantly by number of references (Supplementary Tables S4–S7); however, a large k-mer length relative to the number of reference genomes hinders recall (Supplementary Fig. S8). A pre-built expam database is made publicly available to overcome the comparatively large computational resources required for database indexing (see Data Availability). The distance tree-based method employed by expam achieves a resolution that matches existing approaches when translated into the taxonomic space while increasing the discriminative power of metagenomic analysis to taxonomy agnostic isolate and clade analysis. This approach provides the ability for targeted analysis including high-resolution assessment and correction of database coverage and clade-specific functional analysis. Click here for additional data file.

19 in total

Review 1. The bacterial species challenge: making sense of genetic and ecological diversity.

Authors: Christophe Fraser; Eric J Alm; Martin F Polz; Brian G Spratt; William P Hanage
Journal: Science Date: 2009-02-06 Impact factor: 47.728

Review 2. Gut microbiota and IBD: causation or correlation?

Authors: Josephine Ni; Gary D Wu; Lindsey Albenberg; Vesselin T Tomov
Journal: Nat Rev Gastroenterol Hepatol Date: 2017-07-19 Impact factor: 46.802

3. MetaCache: context-aware classification of metagenomic reads using minhashing.

Authors: André Müller; Christian Hundt; Andreas Hildebrandt; Thomas Hankeln; Bertil Schmidt
Journal: Bioinformatics Date: 2017-12-01 Impact factor: 6.937

4. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3.

Authors: Francesco Beghini; Lauren J McIver; Aitor Blanco-Míguez; Leonard Dubois; Francesco Asnicar; Sagun Maharjan; Ana Mailyan; Paolo Manghi; Matthias Scholz; Andrew Maltez Thomas; Mireia Valles-Colomer; George Weingart; Yancong Zhang; Moreno Zolfo; Curtis Huttenhower; Eric A Franzosa; Nicola Segata
Journal: Elife Date: 2021-05-04 Impact factor: 8.140

5. Gut microbiome metagenomics analysis suggests a functional model for the development of autoimmunity for type 1 diabetes.

Authors: Christopher T Brown; Austin G Davis-Richardson; Adriana Giongo; Kelsey A Gano; David B Crabb; Nabanita Mukherjee; George Casella; Jennifer C Drew; Jorma Ilonen; Mikael Knip; Heikki Hyöty; Riitta Veijola; Tuula Simell; Olli Simell; Josef Neu; Clive H Wasserfall; Desmond Schatz; Mark A Atkinson; Eric W Triplett
Journal: PLoS One Date: 2011-10-17 Impact factor: 3.240

6. Microbial abundance, activity and population genomic profiling with mOTUs2.

Authors: Alessio Milanese; Daniel R Mende; Lucas Paoli; Guillem Salazar; Hans-Joachim Ruscheweyh; Miguelangel Cuenca; Pascal Hingamp; Renato Alves; Paul I Costea; Luis Pedro Coelho; Thomas S B Schmidt; Alexandre Almeida; Alex L Mitchell; Robert D Finn; Jaime Huerta-Cepas; Peer Bork; Georg Zeller; Shinichi Sunagawa
Journal: Nat Commun Date: 2019-03-04 Impact factor: 14.919

7. CCMetagen: comprehensive and accurate identification of eukaryotes and prokaryotes in metagenomic data.

Authors: Vanessa R Marcelino; Philip T L C Clausen; Jan P Buchmann; Michelle Wille; Jonathan R Iredell; Wieland Meyer; Ole Lund; Tania C Sorrell; Edward C Holmes
Journal: Genome Biol Date: 2020-04-28 Impact factor: 13.583

Review 8. Array programming with NumPy.

Authors: Charles R Harris; K Jarrod Millman; Stéfan J van der Walt; Ralf Gommers; Pauli Virtanen; David Cournapeau; Eric Wieser; Julian Taylor; Sebastian Berg; Nathaniel J Smith; Robert Kern; Matti Picus; Stephan Hoyer; Marten H van Kerkwijk; Matthew Brett; Allan Haldane; Jaime Fernández Del Río; Mark Wiebe; Pearu Peterson; Pierre Gérard-Marchant; Kevin Sheppard; Tyler Reddy; Warren Weckesser; Hameer Abbasi; Christoph Gohlke; Travis E Oliphant
Journal: Nature Date: 2020-09-16 Impact factor: 49.962

9. Metalign: efficient alignment-based metagenomic profiling via containment min hash.

Authors: Nathan LaPierre; Mohammed Alser; Eleazar Eskin; David Koslicki; Serghei Mangul
Journal: Genome Biol Date: 2020-09-10 Impact factor: 13.583