| Literature DB >> 33910647 |
Dieter M Tourlousse1, Koji Narita2,3, Takamasa Miura4, Mitsuo Sakamoto5, Akiko Ohashi1, Keita Shiina1, Masami Matsuda1, Daisuke Miura1, Mamiko Shimamura4, Yoshifumi Ohyama4, Atsushi Yamazoe4, Yoshihito Uchino4, Keishi Kameyama2,6, Shingo Arioka2,7, Jiro Kataoka2,8, Takayoshi Hisada2,9, Kazuyuki Fujii2,10, Shunsuke Takahashi2,9, Miho Kuroiwa2,7, Masatomo Rokushima2,7, Mitsue Nishiyama2,11, Yoshiki Tanaka2,12, Takuya Fuchikami2,13, Hitomi Aoki2,13, Satoshi Kira2,13, Ryo Koyanagi2,14, Takeshi Naito2,15, Morie Nishiwaki2,15, Hirotaka Kumagai2,16, Mikiko Konda2,16, Ken Kasahara2,3, Moriya Ohkuma5, Hiroko Kawasaki4, Yuji Sekiguchi17, Jun Terauchi18,19.
Abstract
BACKGROUND: Validation and standardization of methodologies for microbial community measurements by high-throughput sequencing are needed to support human microbiome research and its industrialization. This study set out to establish standards-based solutions to improve the accuracy and reproducibility of metagenomics-based microbiome profiling of human fecal samples.Entities:
Keywords: Accuracy, reproducibility, and comparability; DNA extraction; Gut microbiota; Human microbiome; Industrialization; Library construction; Metagenomics; Standardization
Year: 2021 PMID: 33910647 PMCID: PMC8082873 DOI: 10.1186/s40168-021-01048-3
Source DB: PubMed Journal: Microbiome ISSN: 2049-2618 Impact factor: 14.650
Fig. 1Comparison of protocols for sequencing library construction. a Compositional PCA ordination plot of measured DNA mock community compositions, based on clr (centered log ratio) transformed abundances. The red bold letter T depicts the expected composition (“ground truth”) projected onto the PCA ordination and symbols show individual replicates. Values in the axis labels represent the percentage of variance explained. Protocol identifiers were overlayed with jitter to prevent overlapping labels. b Dependence of the metric variance of measured compositions on DNA input amount and corresponding PCR conditions for library amplification (X0, XL, and XH). c Relationship between differences in genomic GC content of pairs of genomes/strains and their contribution to the metric variance shown in panel b. d Protocol-dependent variation in quantification bias due to genomic GC content. The GC bias metric represents the slope of the intercept-free linear regression line of log2-transformed abundance ratios for all possible pairs of strains to their differences in genomic GC content (see Fig. S3). e Variation in proportion of PCR duplicates. Protocols are ordered along the y-axis as in panel d and both panels share a common y-axis. f, g Closeness of agreement between the ground truth and measured compositions, expressed in terms of Aitchison distances (f) and absolute fold-differences (g). Kits are ranked along the y-axis based on Aitchison distances, averaged across DNA input amounts for each of the kits. For panel g, colored symbols show the geometric mean of strain-wise absolute fold-differences to the ground truth (that is, gmAFD) and black circles represent fold-differences for individual strains. h Heatmap of pairwise Aitchison distances showing quantitative consistency of measured compositions among protocols. i Variation in fragmentation bias, expressed as Aitchison distances between observed and expected base frequencies averaged across the first fifteen cycles of the forward read (see Fig. S4). j Variation in N50 values of the DNA mock community metagenome assemblies. For panels g–j, protocols are sorted as in panel f. For panels b, c, and f–h, values were computed based on the center (compositional mean) of three technical replicates. For panels d, e, i, and j, results are shown as the mean (symbols) and standard deviation (error bars), if visible, of three technical replicates. Across all panels, common symbol fill colors and shapes reflect kits and DNA input amounts, respectively, as shown in the legend of panel a
Fig. 2Comparison of protocols for DNA extraction. a Compositional PCA ordination plot of measured cell mock community compositions, based on clr (centered log ratio) transformed abundances. The red bold letter T shows the expected composition (“ground truth”) projected onto the PCA ordination and symbols represent individual replicates. Values in the axis labels represent the percentage of variance explained. For protocols N, L, and S, arrows show approximate trajectories of measurements for DNA extractions performed with increasing total bead-beating time. b Relationship between the Gram-type cell walls of pairs of strains and their contribution to the metric variance across protocols. c Cumulative relative abundance of Gram-positives (denoted as G+) as a function of bead-beating regime. The dashed horizontal line represents the expected proportion. d Measured abundances of different strains, relative to E. coli strain NBRC 3301, as a function of bead-beating regime for protocol N. Colors represent different Gram-positives and results for all Gram-negatives are shown as dotted gray lines. e Ranking of protocols based on the closeness of agreement between the ground truth and measured compositions, expressed in terms of Aitchison distances. f Effect of total bead-beating time on agreement between measured compositions and the ground truth, expressed in terms of Aitchison distances (left panel) and gmAFDs (right panel). Horizontal dashed lines represent corresponding values for protocol Q. g Scatter plots showing quantitative agreement between community profiles measured with protocol Q (x-axis) and protocols L, N, and S (y-axis) for the cell mock community (upper plots) and fecal sample S01 (lower plots). For the fecal sample, relative abundances were calculated as the percentage of reads assigned to a given species by kraken2. Gray areas represent up to 1.5- or 2-fold differences for the upper and lower plots, respectively. Data represent the mean and standard deviation of two or three technical replicates and corresponding gmAFDs calculated based on the means are indicated in the facet labels. For panels c and d, results are shown as the mean (symbols or lines) and the standard deviation (error bars or ribbons, if visible) of two or three technical replicates. For panels e and f, values were computed based on the center (compositional mean) of two or three technical replicates. Across all panels, common symbol and line colors reflect DNA extraction protocols/kits, as shown in the legend of panel a
Fig. 3Evaluation of intermediate precision and interlaboratory reproducibility of SOPs for sequencing library construction and DNA extraction. a, b Distribution of pairwise Aitchison distances of replicated measurements associated with different operator+lot combinations and laboratories, for library construction (a) and DNA extraction (b), as evaluated using the DNA and cell mock community, respectively. Violin plots depict the distribution of all pairwise distances and symbols shown individual datapoints. If applicable, protocol identifiers are indicated. The subpanel in b shows distances between three custom DNA extraction protocols (shown as different shapes) and protocol N. c Bar charts of estimated qmCVs attributed to different components of variance. Error bars for the intermediate precision and interlaboratory reproducibility estimates represent one-sided 95% confidence intervals. d Similar to panel b, for fecal samples. The first two subpanels show results for fecal sample S01. The third subpanel shows distances between different samples (that is, feces from different donors, denoted as S01 to S13) based on measurements performed by the central laboratory, with DNA extraction and library construction performed using protocols N and BL, respectively. The fourth subpanel shows distances between replicated measurements performed by four laboratories for samples S01 to S13, with all steps (that is, DNA extraction, library construction and sequencing) performed by the participating laboratories. Data from laboratories for which at least one of the duplicate measurements was considered an outlier are shown as red symbols in the first and fourth subpanels (see Fig. S17). e Relationship between species-wise coefficients of variation (CVs) of measured relative abundances and mean relative abundances across replicated measurements to calculate the LOQ (fitted CV of 40%) under different levels of replication as indicated by the colors. Note that CV values exceeding 100% were set to 100% for visualization purposes only. The gray line represents the fit of a species’ probability of detection (POD) to its mean abundance to estimate the LOD (see Fig. S18). f Bar charts of estimated qmCVs attributed to different components of variance for fecal sample S01. Error bars for the precision and reproducibility estimates are one-sided 95% confidence intervals. For panels c and f, fill colors show the metric variance component based on which the corresponding qmCV values were calculated (see Supplementary Methods)
Fig. 4Assessment of the SOPs with the MOSAIC Standards Challenge samples. a Abundance of five major phyla across the five fecal samples (designated as 1 to 5 following the MOSAIC Standards Challenge naming), expressed as the proportion of reads assigned to each phylum. Violin plots show the distribution of publicly available data in the MOSAIC Standards Challenge database, and gray symbols show data for protocols N (squares), P (circles) and Q (diamonds), with individual measurements results shown for each protocol. b Proportion of reads assigned to the species M. smithii for fecal sample 2. Datasets are ranked by decreasing abundance, white circles are for public data, and gray symbols show individual results for protocols N, P and Q as in panel a. c Shannon diversity across fecal samples. d Distribution of Aitchison distances to protocol Q. The short black horizontal lines show the distance between two technical replicates for protocol Q. For panels c and d, gray symbols show individual results for protocols N, P and Q as in panel a, and boxplots show the distribution of public data in the MOSAIC Standards Challenge database. For the boxplots, the tick horizontal line represents the median, hinges show the 25th and 75th percentiles, whiskers extend to the largest and smallest value at most 1.5× the IQR (interquartile range) from the upper and lower hinges, respectively, and outlying datapoints beyond the end of the whiskers are shown as black circles. For all panels, larger symbols represent data deposited to the MOSAIC Standards Challenge website and smaller symbols are for additional replicates available in the SRA (see Table S8). Symbol shapes are common for all panels