Literature DB >> 34791031

metaSNV v2: detection of SNVs and subspecies in prokaryotic metagenomes.

Thea Van Rossum¹, Paul I Costea¹, Lucas Paoli², Renato Alves¹, Roman Thielemann¹, Shinichi Sunagawa², Peer Bork^1,3,4.

Abstract

SUMMARY: Taxonomic analysis of microbial communities is well supported at the level of species and strains. However, species can contain significant phenotypic diversity and strains are rarely widely shared across global populations. Stratifying the diversity between species and strains can identify "subspecies", which are a useful intermediary. High-throughput identification and profiling of subspecies is not yet supported in the microbiome field. Here, we use an operational definition of subspecies based on SNV patterns within species to identify and profile subspecies in metagenomes, along with their distinctive SNVs and genes. We incorporate this method into metaSNV v2, which extends existing SNV-calling software to support further SNV interpretation for population genetics. These new features support microbiome analyses to link SNV profiles with host phenotype or environment and niche-specificity. We demonstrate subspecies identification in marine and faecal metagenomes. In the latter, we analyse 70 species in 7,524 adult and infant subjects, supporting a common subspecies population structure in the human gut microbiome and illustrating some limits in subspecies calling.
AVAILABILITY AND IMPLEMENTATION: Source code, documentation, tutorials, and test data are available at https://github.com/metasnv-tool/metaSNV and https://metasnv.embl.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Year: 2021 PMID： 34791031 PMCID： PMC8796361 DOI： 10.1093/bioinformatics/btab789

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Metagenomic single nucleotide variant (SNV) calling has proven useful in many contexts, such as tracking strains between habitats (Schmidt ) and identifying subspecies in the human microbiome (Costea ). Subspecies are a useful taxonomic resolution because they often have distinct habitats and/or functional traits (Monroe, 1982; Patten, 2015; Van Rossum ). For example, Bifidobacterium longum has subspecies associated with infants (subsp. infantis) and nonhuman animal hosts (subsp. suis) (O’Callaghan ) and subspecies (‘phylotypes’) in Escherichia coli are associated with differences in habitat, antibiotic resistance and pathogenicity (Bailey ). While species-specific microbiome approaches exist (Karcher ; Milani ), no software yet exists to broadly delineate subspecies from metagenomic data. Many tools characterize population-level diversity [MIDAS (Nayfach ), metaSNV v1 (Costea ), POPGENOM (Sjöqvist ), inStrain (Olm et al., 2021), StrainPhlAn (Truong )] and/or recover haplotypes or strains from metagenomes [DESMAN (Quince ), ConStrains (Luo ), InStrain (Olm ), strainGEMS (Tan )], with varying definitions of strains (Van Rossum ). However, none of these tools provide robust clustering of population diversity for data-driven identification of subspecies. Here, we present metaSNV v2, which supports detection of SNVs and population genetic analysis, including population subspecies identification and profiling.

2 Tool description

metaSNV v2 builds on metaSNV v1 (Costea ), which identifies SNVs using SAMtools mpileup (Li, 2011) and mappings of short read metagenomic data against species-specific genomic references (BAM files). Dissimilarities between metagenomes based on SNV profiles are then calculated for each species. In metaSNV v2, the SNV postprocessing is improved and extended in various ways. For example, a subspecies module has been added (detailed below), SNV filtering has been parallelized and estimates of purifying selection have been added. metaSNV v2 now natively supports a larger database of high quality reference genomes [based on ProGenomes2 (Mende )]. Comprehensive user documentation has been newly developed, including a detailed description of the method (see GitHub repository). The most significant functionality added to metaSNV v2 is the ‘subpopr module’. In short, this module detects ‘population subspecies’ (Van Rossum ) by calculating SNV-based dissimilarities between a species’ populations across metagenomic samples and assessing whether these populations form distinct clusters. Subspecies detection is performed for each species in the reference database, can be run in parallel, and follows the steps described below using the output of metaSNV v2’s SNV calling and filtering. For each species, a ‘discovery subset’ of metagenomes is selected wherein the species is abundant and its population likely contains a single subspecies. The latter criterium is satisfied if a metagenome contains minimal internal allele variation relative to the SNV variation across all sampled metagenomes (e.g. at least 80% of the dataset-wide species SNVs have the same allele in over 90% of reads in a metagenome). This criterium has been previously used (Costea ), and since conceptualized as ‘quasi-phaseability’ (Garud ). The default parameters target subspecies but can be altered to detect subpopulations defined in a more stringent or lenient way (i.e. with varying levels of diversity between and within them) (discussed in Supplementary Information S1). If no metagenomes meet the discovery subset criteria for a species, then subspecies cannot be detected for that species. This discovery subset of metagenomes is then tested for robust clustering into subspecies based on their SNV-profile dissimilarities. Clustering confidence is assessed using repeated subsampling and the Prediction Strength algorithm (Tibshirani and Walther, 2005), which yield confidence scores for both the number of clusters (subspecies) and their compositions. Distinctive genotyping SNV alleles are then identified per subspecies. These alleles can be used to estimate the relative abundance of each subspecies in any metagenome, including from later independent studies. Subspecies-specific genes are detected by testing for correlations in abundance between genes and subspecies across metagenomes. All results are summarized in plain text and html reports, with embedded plots and statistical test results.

3 Results

To demonstrate the functionality of metaSNV v2, we analyzed 7523 human fecal metagenomes from adults and infants from 27 countries for 70 prevalent and abundant gut species, of which 42 stratified into multiple subspecies (Supplementary Information S1). To compare this to a previous subspecies estimate (Costea ), a ‘reduced’ analysis of 1663 adult-only, geographically limited metagenomes was run and 81% (44/54) agreement in subspecies presence was observed with the previous study (Supplementary Information S1). In the ‘reduced’ and the full (N = 7523) datasets, 83% (44/53) of species had the same number or lack of subspecies, illustrating the dependence on the input set of metagenomes as in many habitats bacterial populations are still insufficiently covered. For example, subspecies were not detected for B.longum in the adult-only dataset, but its subspecies associated with infants [ssp. infantis (Sela )] was detected in the analysis which included infants. As a further usage demonstration, metaSNV v2 was run on 288 marine metagenomes (Sunagawa ). Out of 10 species with sufficient prevalence and abundance, subspecies were found for two species, each with distinctive geographic enrichment (Supplementary Information S2). The core SNV-calling code has not been altered in metaSNV v2 and the resource usage statistics and software comparison from the previous release still apply (Costea ). Many tools delineate strains and/or build within-species phylogenetic trees from metagenomic data [listed above and reviewed in Van Rossum ]. These tools have different concepts of strains, yet all have a fundamentally different resolution than that of subspecies. This difference precludes a direct comparison to these tools for subspecies classifications. To validate the results from subspecies calling in metaSNV v2, we used simulated and real data from species with known subspecies. A metaSNV v2 analysis of simulated metagenomes (N = 543) composed from mixtures of 540 E.coli genomes representing 11 phylogroups (Waters ; analogous to subspecies) accurately recovered nine subspecies, with the remaining two closely related phylogroups merged into one subspecies (Supplementary Information S3). After pooling these two phylogroups, genomes were accurately classified to their expected phylogroups in 98% of cases, with all misclassifications to a closely related phylogroup. Though subspecies classifications are not produced by other tools, metaSNV v2 and StrainPhlAn both output SNV-based metagenome similarities, which were well correlated for these simulated metagenomes (Spearman R > 0.8, P < 2.2e−16, Supplementary Information S3). In silico E.coli and B.longum metagenomes were used to validate the subspecies identified from the metaSNV v2 analysis of 7523 fecal metagenomes. Results were as expected for 99% of E.coli metagenomes (318/321) and all B.longum metagenomes (12/12). Further, as expected, the subspecies corresponding to B.longum susbp. infantis was almost exclusively seen in infants (61/62) and contained a marker gene for the subspecies, sialidase (Blanco ; Supplementary Information S3).

4 Conclusions

metaSNV v2 features a number of technical improvements over its predecessor in performing within-species SNV calling on metagenomic samples and expanded functionality for SNV-based population genetic analyses, including subspecies and respective differential gene content detection. This supports comparisons of samples at a taxonomic resolution that is derived from the structure of the data itself, enabling hypothesis generation for phenotypic associations and niche adaptation. Click here for additional data file.

22 in total

1. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.

Authors: Heng Li
Journal: Bioinformatics Date: 2011-09-08 Impact factor: 6.937

Review 2. Diversity within species: interpreting strains in microbiomes.

Authors: Thea Van Rossum; Pamela Ferretti; Oleksandr M Maistrenko; Peer Bork
Journal: Nat Rev Microbiol Date: 2020-06-04 Impact factor: 60.633

3. Evaluation of bifidobacterial community composition in the human gut by means of a targeted amplicon sequencing (ITS) protocol.

Authors: Christian Milani; Gabriele A Lugli; Francesca Turroni; Leonardo Mancabelli; Sabrina Duranti; Alice Viappiani; Marta Mangifesta; Nicola Segata; Douwe van Sinderen; Marco Ventura
Journal: FEMS Microbiol Ecol Date: 2014-09-08 Impact factor: 4.194

4. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains.

Authors: Matthew R Olm; Alexander Crits-Christoph; Keith Bouma-Gregson; Brian A Firek; Michael J Morowitz; Jillian F Banfield
Journal: Nat Biotechnol Date: 2021-01-18 Impact factor: 68.164

5. Microbial strain-level population structure and genetic diversity from metagenomes.

Authors: Duy Tin Truong; Adrian Tett; Edoardo Pasolli; Curtis Huttenhower; Nicola Segata
Journal: Genome Res Date: 2017-02-06 Impact factor: 9.043

6. Evolutionary dynamics of bacteria in the gut microbiome within and across hosts.

Authors: Nandita R Garud; Benjamin H Good; Oskar Hallatschek; Katherine S Pollard
Journal: PLoS Biol Date: 2019-01-23 Impact factor: 8.029

7. proGenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes.

Authors: Daniel R Mende; Ivica Letunic; Oleksandr M Maistrenko; Thomas S B Schmidt; Alessio Milanese; Lucas Paoli; Ana Hernández-Plaza; Askarbek N Orakov; Sofia K Forslund; Shinichi Sunagawa; Georg Zeller; Jaime Huerta-Cepas; Luis Pedro Coelho; Peer Bork
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

8. ConStrains identifies microbial strains in metagenomic datasets.

Authors: Chengwei Luo; Rob Knight; Heli Siljander; Mikael Knip; Ramnik J Xavier; Dirk Gevers
Journal: Nat Biotechnol Date: 2015-09-07 Impact factor: 54.908

9. Revisiting the Metabolic Capabilities of Bifidobacterium longum susbp. longum and Bifidobacterium longum subsp. infantis from a Glycoside Hydrolase Perspective.

Authors: Guillermo Blanco; Lorena Ruiz; Hector Tamés; Patricia Ruas-Madiedo; Florentino Fdez-Riverola; Borja Sánchez; Anália Lourenço; Abelardo Margolles
Journal: Microorganisms Date: 2020-05-13

10. Easy phylotyping of Escherichia coli via the EzClermont web app and command-line tool.

Authors: Nicholas R Waters; Florence Abram; Fiona Brennan; Ashleigh Holmes; Leighton Pritchard
Journal: Access Microbiol Date: 2020-06-19

1 in total

Review 1. Recent insights into the role of microbiome in the pathogenesis of obesity.

Authors: Eduard W J van der Vossen; Marcus C de Goffau; Evgeni Levin; Max Nieuwdorp
Journal: Therap Adv Gastroenterol Date: 2022-08-09 Impact factor: 4.802

1 in total