| Literature DB >> 26849217 |
Adam G Clooney1,2,3, Fiona Fouhy4,3, Roy D Sleator2,3, Aisling O' Driscoll5,3, Catherine Stanton4,3, Paul D Cotter4,3, Marcus J Claesson1,3.
Abstract
Rapid advancements in sequencing technologies along with falling costs present widespread opportunities for microbiome studies across a vast and diverse array of environments. These impressive technological developments have been accompanied by a considerable growth in the number of methodological variables, including sampling, storage, DNA extraction, primer pairs, sequencing technology, chemistry version, read length, insert size, and analysis pipelines, amongst others. This increase in variability threatens to compromise both the reproducibility and the comparability of studies conducted. Here we perform the first reported study comparing both amplicon and shotgun sequencing for the three leading next-generation sequencing technologies. These were applied to six human stool samples using Illumina HiSeq, MiSeq and Ion PGM shotgun sequencing, as well as amplicon sequencing across two variable 16S rRNA gene regions. Notably, we found that the factor responsible for the greatest variance in microbiota composition was the chosen methodology rather than the natural inter-individual variance, which is commonly one of the most significant drivers in microbiome studies. Amplicon sequencing suffered from this to a large extent, and this issue was particularly apparent when the 16S rRNA V1-V2 region amplicons were sequenced with MiSeq. Somewhat surprisingly, the choice of taxonomic binning software for shotgun sequences proved to be of crucial importance with even greater discriminatory power than sequencing technology and choice of amplicon. Optimal N50 assembly values for the HiSeq was obtained for 10 million reads per sample, whereas the applied MiSeq and PGM sequencing depths proved less sufficient for shotgun sequencing of stool samples. The latter technologies, on the other hand, provide a better basis for functional gene categorisation, possibly due to their longer read lengths. Hence, in addition to highlighting methodological biases, this study demonstrates the risks associated with comparing data generated using different strategies. We also recommend that laboratories with particular interests in certain microbes should optimise their protocols to accurately detect these taxa using different techniques.Entities:
Mesh:
Substances:
Year: 2016 PMID: 26849217 PMCID: PMC4746063 DOI: 10.1371/journal.pone.0148028
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Heat-plot representing the taxonomic composition of the samples at genus level.
The heat-plot also includes amplicon data long shotgun datasets from three classifiers namely: MetaPhlAn2, Kraken and GOTTCHA. Only genera in a minimum of 20% of datasets were retained. The method of correlation used was Spearman along with Ward D2 Clustering (PGM = Ion Personal Genome Machine).
Fig 2Bar-charts of taxonomic composition at family level.
The families are first organised by phylum abundance (highest to lowest) followed by family abundance (highest to lowest) in each of the phyla. The numbers of observed species are located at the top of each bar.
Fig 3Observed Species at various sequencing depths for the amplicon data using SPINGO.
The data points represent the median values across the 6 samples and the error bars are the 25% and 75% quartile ranges.
Fig 4N50 values representing randomly subsampled reads at various sequencing depths after assembly by IDBA_UD.
Each point represents the median value across each of the 6 samples per technology (including 3 replicates per sample). Error bars are the 25% and 75% quartile ranges.
Fig 5Number of species observed from randomly subsampled reads using MetaPhlAn2.
Each point represents the median value across each of the 6 samples per technology (including 3 replicates per sample). Error bars are the 25% and 75% quartile ranges.
Fig 6Core and unique genes acquired by Metaphor with 600,000 sequencing randomly selected datasets for each of the samples.
The numbers represent the total number of predicted complete or incomplete genes for each metagenome.