| Literature DB >> 30082706 |
Simon A Hardwick1,2, Wendy Y Chen1,2, Ted Wong1, Bindu S Kanakamedala1, Ira W Deveson1,2, Sarah E Ongley3,4, Nadia S Santini5,6, Esteban Marcellin7, Martin A Smith1,2, Lars K Nielsen7, Catherine E Lovelock8, Brett A Neilan3,4, Tim R Mercer9,10,11.
Abstract
The complexity of microbial communities, combined with technical biases in next-generation sequencing, pose a challenge to metagenomic analysis. Here, we develop a set of internal DNA standards, termed "sequins" (sequencing spike-ins), that together constitute a synthetic community of artificial microbial genomes. Sequins are added to environmental DNA samples prior to library preparation, and undergo concurrent sequencing with the accompanying sample. We validate the performance of sequins by comparison to mock microbial communities, and demonstrate their use in the analysis of real metagenome samples. We show how sequins can be used to measure fold change differences in the size and structure of accompanying microbial communities, and perform quantitative normalization between samples. We further illustrate how sequins can be used to benchmark and optimize new methods, including nanopore long-read sequencing technology. We provide metagenome sequins, along with associated data sets, protocols, and an accompanying software toolkit, as reference standards to aid in metagenomic studies.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30082706 PMCID: PMC6078961 DOI: 10.1038/s41467-018-05555-0
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Schematic showing the design, use, and validation of DNA-sequencing spike-ins (sequins) for metagenomic analysis. a Metagenome sequins are designed by inverting a selected subsequence of a microbial genome that is then synthesized and combined with other DNA standards into a staggered mixture which represents a natural microbial community. Sequins are spiked into a user’s DNA sample (at a low fractional concentration, e.g., 2%), undergoing combined library preparation, sequencing, and analysis. Examples of the bioinformatic steps that can be assessed using sequins are indicated. To validate the design of sequins, we generated simulated libraries that were not confounded by additional technical variables (b, c) and spiked sequins into a mock microbial community (MBARC-26) and sequenced the combined sample (d, e). Sequins (blue) displayed comparable breadth of alignment coverage to real microbial genomes (red) across the full range of fold-coverage depths observed in both simulated (b) and experimental (d) data. We also observed equivalent de novo assembly of sequins and MBARC-26 genomes at matched fold-coverage levels in both simulated (c) and experimental (e) data. The dashed lines in d and e were fitted using Richard’s five-parameter dose–response curve
Fig. 2Using sequins to assess quantitative accuracy and de novo assembly. a Scatter plot shows measured fold-coverage for each sequin plotted against its known input concentration, in triplicate. Error bars indicate standard deviation (SD) between replicates. A linear regression model is fitted to the data (dotted line) with 95% confidence interval shown (gray shading). b By sequencing a neat preparation of the staggered sequins mixture (Mix A) in triplicate, we observed the same coverage patterns and biases in each replicate. Example shows genome browser views of the sequencing coverage for three replicates of a randomly chosen sequin, MG_23. c Scatter plot shows the fraction of each sequin de novo-assembled plots against its known input concentration, in triplicate. Error bars indicate SD between replicates. d Genome browser views of two sequins demonstrate the impact of sequence coverage on assembly. MG_46 (top) shows high sequence coverage, and was fully assembled in a single contig. By contrast, MG_56 (bottom) has low sequence coverage and was only partially assembled with four fragmented contigs
Fig. 3Using sequins to assess fold changes and normalize between real metagenome samples of unknown content. a Scatter plot shows observed log2 fold change (LFC) against expected LFC for each sequin across Mixes A and B (nine replicates of each). Error bars indicate standard deviation (SD) among replicates. The subset of sequins that remain fixed between mixtures (red) are used in RUVg normalization. b Box plots show the coefficient of variation (SD/mean) between replicates for each phylum within each sample type, both before (blue) and after (red) RUVg normalization. Box center lines indicate median; bounds of boxes indicate upper and lower quartiles; whiskers extend to min/max values. c Without normalization, samples clustered loosely by sample type rather than environment, as shown by principal component analysis (PCA; left). However, relative log expression (RLE) box plots demonstrated a clear need for normalization, with samples not centered around zero and wide differences in variation between samples (right). d After performing RUVg normalization using sequins, samples still clustered by sample type (left), and RLE plots showed a marked improvement with samples now centered around zero and the excessive variation of most samples removed (right). Quantification results are based on phylum-level abundance estimates from read mapping (using MG-RAST; n = 75 phyla)
Fig. 4Using sequins to accurately resolve both relative and absolute changes in microbial load. a We assembled six mock microbial mixtures comprising different amounts of four distinct cyanobacteria species. A fixed (as opposed to fractional) amount of sequins (red) was added to each mixture. b Normalizing on the basis of library depth and genome size enabled the fractional composition of each mixture to be accurately measured; however, Mixes A1, A2, and A3 were indistinguishable from each other (likewise for Mixes B1, B2, and B3). However, as the amount of microbial DNA increased, the fraction of reads aligning to sequins decreased concomitantly. c By dividing the fractional abundance of each species by the fraction of reads aligned to sequins in each mixture, this enabled each sample’s reference point to be rescaled, thereby allowing the accurate resolution of both relative and absolute abundance. d We tracked the fold changes observed for the Synechocystis sp. PCC 6803 genome across the six mixtures, providing 15 separate fold change comparisons, including both relative and absolute abundance shifts. Bar graph shows the log2 fold changes (LFC) observed after performing normalization on the basis of genome size and library depth (blue), and after carrying out additional normalization with sequins (red). The true LFC value is shown in gray. e Scatter plot shows the measured LFC for each species plotted against the expected LFC (a total of 60 comparisons). R2 value indicates the correlation coefficient obtained after carrying out linear regression analysis