| Literature DB >> 34040632 |
Sergio Andreu-Sánchez1,2, Lianmin Chen1,2, Daoming Wang1, Hannah E Augustijn1, Alexandra Zhernakova1, Jingyuan Fu1,2.
Abstract
Microbes live in complex communities that are of major importance for environmental ecology, public health, and animal physiology and pathology. Short-read metagenomic shotgun sequencing is currently the state-of-the-art technique for exploring these communities. With the aid of metagenomics, our understanding of the microbiome is moving from composition toward functionality, even down to the genetic variant level. While the exploration of single-nucleotide variation in a genome is a standard procedure in genomics, and many sophisticated tools exist to perform this task, identification of genetic variation in metagenomes remains challenging. Major factors that hamper the widespread application of variant-calling analysis include low-depth sequencing of individual genomes (which is especially significant for the microorganisms present in low abundance), the existence of large genomic variation even within the same species, the absence of comprehensive reference genomes, and the noise introduced by next-generation sequencing errors. Some bioinformatics tools, such as metaSNV or InStrain, have been created to identify genetic variants in metagenomes, but the performance of these tools has not been systematically assessed or compared with the variant callers commonly used on single or pooled genomes. In this study, we benchmark seven bioinformatic tools for genetic variant calling in metagenomics data and assess their performance. To do so, we simulated metagenomic reads to mimic human microbial composition, sequencing errors, and genetic variability. We also simulated different conditions, including low and high depth of coverage and unique or multiple strains per species. Our analysis of the simulated data shows that probabilistic method-based tools such as HaplotypeCaller and Mutect2 from the GATK toolset show the best performance. By applying these tools to longitudinal gut microbiome data from the Human Microbiome Project, we show that the genetic similarity between longitudinal samples from the same individuals is significantly greater than the similarity between samples from different individuals. Our benchmark shows that probabilistic tools can be used to call metagenomes, and we recommend the use of GATK's tools as reliable variant callers for metagenomic samples.Entities:
Keywords: benchmark; metagenomics; short-reads; shotgun sequencing; variant-calling
Year: 2021 PMID: 34040632 PMCID: PMC8141913 DOI: 10.3389/fgene.2021.648229
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Summary of tools benchmarked and used for different analyses.
| Tool name | Probabilistic | Pool population | Joint calling | Minimal coverage | ROC curve | Real data |
| BCFtools | Yes | No | Yes | No | Yes | No |
| freebayes | Yes | Yes | Yes | No | Yes | No |
| HaplotypeCaller | Yes | No | Yes | No | Yes | Yes |
| Mutect2 | Yes | Yes | Yes | No | No | Yes |
| VarScan2 | No | Yes | No | 8 | No | No |
| metaSNV | No | Yes | Yes | 4 | No | No |
| InStrain | No | Yes | No | 5 | No | No |
FIGURE 1Representation of the pipeline used. Mutations are introduced into each reference genome before simulation. Simulated reads are trimmed, and human contamination removed. Reference genomes are indexed and mapped using simulated cleaned reads. Alignments are further cleaned before variant calling. All variant outputs are converted to the same format. Introduced variants covered by simulated reads are used as a set of true positives. Variants are checked in the formatted variant calls, and receiver operator characteristic (ROC) curves are calculated by repeating this process at different quality thresholds. All statistics are combined in a single file.
FIGURE 2Single strain variant-calling statistics of seven different tools. Colors indicate different tools. (A) Sensitivity (TP/TP + FN) of each tool. Tukey box plot presents the distribution of precision. Dots show precision per individual bacteria. (B) Precision (TP/TP + FP) of each tool. Tukey’s box plot presents the distribution of precision. Dots show precision per individual bacteria. Distribution of (C) TP, (D) FN, and (E) FP per tool as Tukey box plots. Individual dots indicate bacteria over 1.5 times the interquartile distance. (F) Precision vs. sensitivity plot. Dots indicate mean values among all bacteria. Error bars represent the standard deviation from the mean. (G,H) ROC curve of probabilistic methods. X-axis represents the quantile Phred filter. (G) Mean sensitivity changes with changes in Phred score. Line shows the mean value among bacteria. Shading represents the standard deviation from the mean. (H) Mean precision changes with Phred score changes. Line represents mean value among bacteria. Shading represents the standard deviation from the mean. TP, true positives; FP, false positives; FN, false negatives.
FIGURE 3Comparison of variant-calling performance between joint calls and single calls. Y-axis presents tools, including a combination of all tools (All). Tukey box plots are shown and colored according to the variant-calling mode. Single points represent samples >1.5 times the interquartile distance. Asterisks indicate statistically significant differences (p < 0.05) in a paired Wilcoxon test comparing both groups. (A) Sensitivity metrics in the uni-strain scenario. (B) Precision metrics in the uni-strain scenario. (C) Sensitivity metrics in the multi-strain scenario. (D) Precision metrics in the multi-strain scenario.
FIGURE 4Effect of average depth of genome coverage and species abundance on variant calling performance. Each dot represents a sample. (A) Effect of reference genome coverage on precision and sensitivity of each tool. (B) Effect of species abundance on precision and sensitivity of each tool. Trend lines were fitted with local polynomial regression (LOESS). Shading represents the 95% confidence interval of the trend line.
FIGURE 5Variant calling on real data. (A) Site frequency spectrum of variants in the whole metagenome called by HaplotypeCaller and MetaSNV. Total site frequency spectrum is the sum of the variants in the 10 chosen bacterial taxa. (B) Genetic distance between samples. Manhattan distances were calculated from the SNV profile of each sample. Tukey boxplot shows the distribution of distances between samples belonging to the same individual at two timepoints (intra-individual) and between samples from different people (inter-individual). These distances were used to cluster the data. (C) Tukey boxplot of the distribution of Bray-Curtis dissimilarities computed from the estimation of taxonomic abundance between samples from the same individual and from different people.
Single nucleotide variation (SNV) profiles in the Human Microbiome Project (HMP) data.
| Taxa_profiled | Tool | Number of variants | % Clustered | Sensitivity | Precision |
| M | 132,878 | 20.9 | 0.955 | 0.992 | |
| H | 115,968 | 18.6 | 0.952 | 0.996 | |
| M | 180,781 | 74.4 | 0.984 | 0.852 | |
| H | 98,442 | 69.8 | 0.951 | 0.976 | |
| M | 264,480 | 93 | 0.968 | 0.290 | |
| H | 225,677 | 93 | 0.453 | 0.263 | |
| M | 272,535 | 97.7 | 0.994 | 0.705 | |
| H | 218,598 | 93 | 0.957 | 0.986 | |
| M | 120,086 | 16.3 | 0.974 | 0.682 | |
| H | 77,326 | 7 | 0.879 | 0.881 | |
| M | 75,369 | 4.7 | 0.989 | 0.802 | |
| H | 48,786 | 4.7 | 0.963 | 0.964 | |
| M | 234,000 | 79.1 | 0.953 | 0.830 | |
| H | 189,376 | 65.1 | 0.949 | 0.993 | |
| M | 324,595 | 46.5 | 0.972 | 0.931 | |
| H | 217,101 | 37.2 | 0.970 | 0.995 | |
| M | 86,590 | 23.3 | 0.983 | 0.738 | |
| H | 52,391 | 16.3 | 0.922 | 0.910 | |
| M | 182,558 | 34.9 | 0.964 | 0.799 | |
| H | 107,252 | 14 | 0.938 | 0.968 | |
| Total SNV profile | M | 1,873,872 | 93 | NA | NA |
| Total SNV profile | H | 1,350,917 | 93 | NA | NA |