| Literature DB >> 28753663 |
Paul Igor Costea1, Robin Munch1, Luis Pedro Coelho1, Lucas Paoli2,3, Shinichi Sunagawa2, Peer Bork1,4,5,6.
Abstract
We present metaSNV, a tool for single nucleotide variant (SNV) analysis in metagenomic samples, capable of comparing populations of thousands of bacterial and archaeal species. The tool uses as input nucleotide sequence alignments to reference genomes in standard SAM/BAM format, performs SNV calling for individual samples and across the whole data set, and generates various statistics for individual species including allele frequencies and nucleotide diversity per sample as well as distances and fixation indices across samples. Using published data from 676 metagenomic samples of different sites in the oral cavity, we show that the results of metaSNV are comparable to those of MIDAS, an alternative implementation for metagenomic SNV analysis, while data processing is faster and has a smaller storage footprint. Moreover, we implement a set of distance measures that allow the comparison of genomic variation across metagenomic samples and delineate sample-specific variants to enable the tracking of specific strain populations over time. The implementation of metaSNV is available at: http://metasnv.embl.de/.Entities:
Mesh:
Year: 2017 PMID: 28753663 PMCID: PMC5533426 DOI: 10.1371/journal.pone.0182392
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Overview of analysis pipeline and example results.
(A) shows the SNV calling and analysis workflow, consisting of an optional pre-processing step, which splits the computation load into subsets of similar size based on the genome coverage, the main SNV calling step and further post-processing of the raw output, which can be tailored according to the aim of the analysis. (B) shows the Principal Coordinate Analysis projection of a pairwise distance between oral samples, based on population SNVs, which clearly separates strain populations in tongue dorsum samples from those in supra-gingival plaque samples. (C) shows the tracking of the individual SNV frequencies within an individual over a period of 384 days. Each line represents one variant position and the respective colour encodes the amount by which the allele frequency of that position changed over time; red represents stable variants that maintain their frequency while in blue are positions which dramatically change their frequency in the population. Only a small number of positions vary over the measured period, with most remaining at approximately the same population frequency, suggesting great stability of strain populations within the individual.
Fig 2Comparison of metaSNV and MIDAS results.
Correlation coefficient (R2, mantel) for the pairwise distance matrices generated by MIDAS and metaSNV (top). Compared are only sample intersects for species examined with both methods. Jaccard indices for the sample overlap per species was computed (bottom). The average sample number and average Jaccard index over all samples intersect is shown in the legend.
Resource comparison for metaSNV and MIDAS.
| Jobs | CPU/Job | CPUs | Max RAM | Time/Job | Time | Disk | CPU Time | Total RAM | |
|---|---|---|---|---|---|---|---|---|---|
| Alignment | 80 | 32 | 2560 | 51.00 | 100 | 8000 | 256000 | 2484 | |
| Species | 80 | 1 | 80 | 0.02 | 20 | 1600 | 1600 | 1.7 | |
| SNVs | 12 | 2 | 24 | 0.16 | 100 | 1200 | 2400 | 1.9 | |
| Filter/merge | 12 | 1 | 12 | 0.02 | 5 | 60 | 60 | 0.2 | |
| Post | 1 | 1 | 1 | 1 | 1 | 1 | 0 | ||
| 185 | 37 | 2677 | 51.20 | 226 | 10861 | 241 | 260061 | 2488 | |
| Species | 80 | 32 | 2560 | 7.80 | 50 | 4000 | 128000 | 302 | |
| SNVs | 80 | 32 | 2560 | 2.50 | 310 | 24800 | 793600 | 626 | |
| Filter/merge | 1 | 1 | 1 | 1.60 | 3094 | 3094 | 3094 | 0.9 | |
| Post | 1 | 1 | 1 | 1 | 1 | 1 | 0 | ||
| 162 | 66 | 5122 | 11.90 | 3455 | 31895 | 537 | 924695 | 930 | |
| MIDAS/MetaSNV | |||||||||
| From Alignment | |||||||||
| MIDAS/MetaSNV | |||||||||
| From BAM |
Breakdown of resource usage, including number of jobs, number of CPUs, maximum and average RAM usage, CPU time and storage footprint (this number does not include the original fasta files used in the analysis).