| Literature DB >> 35234490 |
Dieter M Tourlousse1, Koji Narita2,3, Takamasa Miura4, Akiko Ohashi1, Masami Matsuda1, Yoshifumi Ohyama4, Mamiko Shimamura4, Masataka Furukawa4, Ken Kasahara2,3, Keishi Kameyama2,5, Sakae Saito6, Maki Goto6, Ritsuko Shimizu6, Riko Mishima7, Jiro Nakayama4,7, Koji Hosomi8, Jun Kunisawa8, Jun Terauchi2,9, Yuji Sekiguchi1, Hiroko Kawasaki4.
Abstract
Standardization and quality assurance of microbiome community analysis by high-throughput DNA sequencing require widely accessible and well-characterized reference materials. Here, we report on newly developed DNA and whole-cell mock communities to serve as control reagents for human gut microbiota measurements by shotgun metagenomics and 16S rRNA gene amplicon sequencing. The mock communities were formulated as near-even blends of up to 20 bacterial species prevalent in the human gut, span a wide range of genomic guanine-cytosine (GC) contents, and include multiple strains with Gram-positive type cell walls. Through a collaborative study, we carefully characterized the mock communities by shotgun metagenomics, using previously developed standardized protocols for DNA extraction and sequencing library construction. Further, we validated fitness of the mock communities for revealing technically meaningful differences among protocols for DNA extraction and metagenome/16S rRNA gene amplicon library construction. Finally, we used the mock communities to reveal varying performance of metagenome-based taxonomic profilers and the impact of trimming and filtering of sequencing reads on observed species profiles. The latter showed that aggressive preprocessing of reads may result in substantial GC-dependent bias and should thus be carefully evaluated to minimize unintended effects on species abundances. Taken together, the mock communities are expected to support a myriad of applications that rely on well-characterized control reagents, ranging from evaluation and optimization of methods to assessment of reproducibility in interlaboratory studies and routine quality control. IMPORTANCE Application of high-throughput DNA sequencing has greatly accelerated human microbiome research and its translation into new therapeutic and diagnostic capabilities. Microbiome community analyses results can, however, vary considerably across studies or laboratories, and establishment of measurement standards to improve accuracy and reproducibility has become a priority. The here-developed mock communities, which are available from the NITE Biological Resource Center (NBRC) at the National Institute of Technology and Evaluation (NITE, Japan), provide well-characterized control reagents that allow users to judge the accuracy of their measurement results. Widespread and consistent adoption of the mock communities will improve reproducibility and comparability of microbiome community analyses, thereby supporting and accelerating human microbiome research and development.Entities:
Keywords: control reagents; human microbiome; metagenomics; standards
Mesh:
Substances:
Year: 2022 PMID: 35234490 PMCID: PMC8941912 DOI: 10.1128/spectrum.01915-21
Source DB: PubMed Journal: Microbiol Spectr ISSN: 2165-0497
Bacterial species included in the mock communities
| Species | Culture collection | Nucleotide accession | Genome size (bp) | GC content (%) | 16S rRNA genes | Cell wall (Gram-type) | Relative abundance in DNA mock | Relative abundance in cell mock |
|---|---|---|---|---|---|---|---|---|
|
| NBRC 113350 | 4,989,532 | 46.2 | 4 | – | 4.7 | 5.6 | |
| NBRC 113351 |
| 6,247,046 | 46.7 | 5 | + | 4.5 | 5.6 | |
|
| NBRC 113352 | 5,687,315 | 48.9 | 5 | + | 5.3 | 5.6 | |
|
| NBRC 113806 |
| 5,179,960 | 45.0 | 7 | – | 4.8 | 5.6 |
| NBRC 13719 | 4,295,305 | 43.3 | 10 | + | 5.2 | 5.6 | ||
|
| NBRC 13955 |
| 2,018,796 | 36.9 | 5 | + | 6.9 | 5.6 |
|
| NBRC 14164 |
| 6,156,701 | 62.3 | 7 | – | 3.9 | 5.6 |
| NBRC 3202 |
| 1,910,306 | 50.1 | 8 | + | 3.6 | 5.6 | |
|
| NBRC 3301 | 4,755,096 | 50.8 | 7 | – | 5.6 | 5.6 | |
|
| NBRC 113805 |
| 4,277,038 | 60.4 | 3 | + | 3.7 | 5.6 |
|
| NBRC 113846 | 2,520,735 | 32.2 | 6 | + | 4.8 | 5.6 | |
| NBRC 113869 |
| 2,560,907 | 60.0 | 3 | + | 5.0 | 5.6 | |
|
| NBRC 114370 | 2,594,022 | 60.1 | 5 | + | 5.7 | 5.6 | |
|
| NBRC 114412 |
| 3,284,789 | 44.5 | 4 | ± | 5.3 | 5.6 |
|
| NBRC 114413 | 3,757,469 | 42.5 | 5 | + | 5.6 | 5.6 | |
|
| NBRC 114414 |
| 2,610,024 | 50.6 | 7 | – | 4.8 | 0 |
|
| NBRC 114415 |
| 2,464,533 | 31.5 | 6 | – | 3.7 | 0 |
|
| NBRC 114504 | 2,278,612 | 60.3 | 5 | + | 6.2 | 5.6 | |
| NBRC 114494 | 2,534,372 | 60.1 | 4 | + | 4.7 | 5.6 | ||
|
| NBRC 114322 | 2,788,458 | 55.7 | 3 | – | 6.0 | 5.6 |
The symbols +, − and ± indicate strains with Gram-positive, Gram-negative and Gram-variable type cells walls, respectively.
Relative abundances represent values assigned during formulation of the mock communities, based on quantification of the total DNA content of individual strains prior to mixing.
M. massiliensis and M. funiformis were excluded from the cell mock community.
FIG 1(a) Bacterial species included in the mock communities. The phylogenetic tree was inferred based on single-copy marker genes, using the GTDB-Tk. Phylum-level taxonomic assignments are indicated by colored circles shown at the leaves. Species features, that is Gram type, genomic GC content and genome size, are shown. (b) Characterization of the mock communities by shotgun metagenomics and taxonomic profiling with MetaPhlAn3. Species profiles were generated based on the combined sequencing data of all replicated measurements, performed following SOPs, for both mock communities (n = 16 and n = 20 for the DNA and cell mock community, respectively, covering multiple aliquots, SOPs and laboratories; Table S3 in the supplemental material). The tree represents a taxonomic dendrogram. The heatmap depicts species profiles of 10 million in silico generated reads (5 million 151-bp paired-end reads) for each of the genomes shown on the x axis, with fill colors showing estimated relative abundances as indicated in the legend. (c) Relative abundance of each strain in the mock communities as assigned based on total DNA content, during formulation, and as measured by shotgun metagenomics. Empty circles shown individual shotgun metagenomics measurement results as mentioned for panel b. Gray and blue/orange density plots show the distribution of simulated strain-wise abundances as indicated in the legend. ‘Acceptance’ ranges for strain-wise abundances are shown as error bars. Note that strains NBRC 114414 (M. massiliensis) and NBRC 114415 (M. funiformis) were not included in the cell mock community.
FIG 2(a) Stacked bar charts of individual measurement results generated in the collaborative study for the DNA (top) and cell (bottom) mock community by amplicon and metagenome sequencing. Symbols below each bar indicate the type of protocol (SOP versus non-SOP for DNA extraction and library construction) and laboratory, as indicated in the legend. Note that the two strains of Bifidobacterium longum (NBRC 114370 and NBRC 114494) were analyzed as a single species. (b) Violin plots of pairwise Aitchison distances for measurements performed using SOPs and non-SOPs for library construction (top) and DNA extraction (bottom). Individual data points were overlaid with jitter. For the cell mock community, only protocols employing SOPs for library construction were included to highlight the effect of DNA extraction. (c) Bar charts of the relative abundance (without 16S rRNA gene copy number correction) of C. acnes strain NBRC 113869 for amplicon libraries prepared following the SOP using KAPA HiFi DNA polymerase and non-SOPs using alternative polymerases as indicated. Red and orange bars represent amplicons perfectly matching the expected primer sequences (Hamming distance, d, of zero), with and without editing, respectively. Blue bars indicate amplicons with at least a single mismatch, including undetermined (N) bases, to the expected edited or unedited primer sequences. The location of the primer mismatches to the template sequences is shown below the plot. (d) Cumulative relative abundance of Gram-positives measured for the cell mock community by amplicon and metagenome sequencing. Bars are colored by laboratory and symbols below the bars show the type of protocol, as in panel a. Bar heights represent the mean and error bars show standard deviations. (e) Violin plots of the distribution of pairwise Aitchison distances for the subcomposition of Gram-positives and negatives, considering SOPs and non-SOPs for DNA extraction and SOPs only for library construction.
FIG 3(a) Number of species classified by each of the profilers plotted as a function of minimum abundance threshold. Data are shown as the mean (colored solid lines) and standard deviation (ribbons, if visible) of 16 replicated measurement of the DNA mock community, following SOPs for library construction. The dashed horizontal line indicates the expected number of species. (b) Estimated abundances of expected species in the DNA mock community. Symbols represent results of individual measurements and are colored according to the taxonomic profiler as in panel a. The graph on the right shows the abundance of false positives (that is, species that are not included in mock the community but classified as present), with the abundant species colored as indicated in the legend. (c) Violin plots showing the distribution of Aitchison distances, calculated based on the abundances of the expected species as shown in panel b, for each of the profilers. The top and bottom plots show all pairwise dissimilarities between technical replicates within a single laboratory (denoted as repeatability) and replicates from different laboratories (denoted as reproducibility), respectively. Larger pairwise distances indicate higher variability across replicated measurements. (d) Impact of read trimming/filtering on read retention and species abundances. The left three panels show the percentage of raw reads retained following read processing by fastp for three representative data sets with varying base quality (blue: Q30 bases of 90.1%; orange: 78.2%; red: 71.7%). Applied settings for read processing by fastp, are indicated below (also see Table S5 in the supplemental material) and sorted according to the percentage of retained reads for the data set with lowest raw base quality. The adjacent plots show the relative abundance of each of the strains, as determined by kallisto, for the same three data sets. Facets are sorted according to increasing genomic GC content, indicated in each of the plots. Filled and empty circles indicate the two settings evaluated for panels d and e. (e) Aitchison distances between species profiles of read data with quality trimming (settings 15_100_65 in Table S5 in the supplemental material) and without quality trimming (settings 0_100_65). Data points, each representing a different library, are plotted as a function of the libraries’ raw base quality (x axis); colors and shapes reflect the mock community and profiling tool, respectively. Solid lines represent loess fits, for visualization purposes only. (f) Violin plots of pairwise Aitchison distances demonstrating the impact of quality trimming (settings 15_100_65, with QT and 0_100_65, without QT) on perceived reproducibility. Only pairwise comparisons for measurements performed by different laboratories following the SOPs were included.