| Literature DB >> 29293960 |
Chao Fang1,2,3, Huanzi Zhong1,2,4, Yuxiang Lin1,2,3, Bing Chen1,2,3, Mo Han1,2,3, Huahui Ren1,2,3, Haorong Lu1,2, Jacob M Luber5,6,7,8,9, Min Xia1,2, Wangsheng Li1,2, Shayna Stein6,10,11, Xun Xu1,2, Wenwei Zhang1, Radoje Drmanac1, Jian Wang1,12, Huanming Yang1,12, Lennart Hammarström13, Aleksandar D Kostic7,8,9, Karsten Kristiansen1,2,4, Junhua Li1,2,3,14.
Abstract
Background: More extensive use of metagenomic shotgun sequencing in microbiome research relies on the development of high-throughput, cost-effective sequencing. Here we present a comprehensive evaluation of the performance of the new high-throughput sequencing platform BGISEQ-500 for metagenomic shotgun sequencing and compare its performance with that of 2 Illumina platforms. Findings: Using fecal samples from 20 healthy individuals, we evaluated the intra-platform reproducibility for metagenomic sequencing on the BGISEQ-500 platform in a setup comprising 8 library replicates and 8 sequencing replicates. Cross-platform consistency was evaluated by comparing 20 pairwise replicates on the BGISEQ-500 platform vs the Illumina HiSeq 2000 platform and the Illumina HiSeq 4000 platform. In addition, we compared the performance of the 2 Illumina platforms against each other. By a newly developed overall accuracy quality control method, an average of 82.45 million high-quality reads (96.06% of raw reads) per sample, with 90.56% of bases scoring Q30 and above, was obtained using the BGISEQ-500 platform. Quantitative analyses revealed extremely high reproducibility between BGISEQ-500 intra-platform replicates. Cross-platform replicates differed slightly more than intra-platform replicates, yet a high consistency was observed. Only a low percentage (2.02%-3.25%) of genes exhibited significant differences in relative abundance comparing the BGISEQ-500 and HiSeq platforms, with a bias toward genes with higher GC content being enriched on the HiSeq platforms. Conclusions: Our study provides the first set of performance metrics for human gut metagenomic sequencing data using BGISEQ-500. The high accuracy and technical reproducibility confirm the applicability of the new platform for metagenomic studies, though caution is still warranted when combining metagenomic data from different platforms.Entities:
Mesh:
Year: 2018 PMID: 29293960 PMCID: PMC5848809 DOI: 10.1093/gigascience/gix133
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:Schematic model summarizing the study design and analysis strategy. Schematic diagram depicting the process of data generation, including collection of fecal samples and extraction of DNA from 20 healthy subjects, library preparation, and sequencing strategy for BGISEQ-500 and HiSeq 2000. Each circle indicates 1 independent subject, with subject ID shown in the circle. For BGISEQ-500, each sample was sheared and tagged with a unique barcode to prepare libraries, then equal amounts of DNA fragments from 8 samples were pooled together for DNB formation, loading, and sequencing. In total, 20 samples were sequenced in 3 lanes (F0, G0, and H0). Of them, DNA from 8 subjects (S01-S08) was utilized to perform library construction and sequencing twice; the corresponding 8 paired datasets from lane I0 (green) and lane F0 (blue) were considered library replicates. DNBs from the same 8 subjects were loaded and sequenced twice to generate 8 paired sequencing replicates (lane F0 and lane F1). Twenty datasets from HiSeq 2000 were also generated in this study. The detailed assessment and comparison analyses of metagenomic datasets between intra- and inter-platforms are shown below.
Figure 2:Evaluation of intra-platform reproducibility. A, Detecting mapped read count fluctuations of genes between intra-platform replicates. Unique IGC mapped reads were downsized to 20 million for each subject, and the read count fluctuations were estimated (Supplementary Methods). The x-axis represents mapped read counts of a gene in replicate 1 (F0), and the y-axis represents mapped read counts of that gene in replicate 2 (F1 as sequencing replicate and I0 as library replicate). The area bordered by the red line represents the 99% confidence interval (CI) of genes showing the expected read count fluctuations in their replicates. The dashed line indicates that, at 99% CI, genes with greater than or equal to 10 reads in replicate 1 (x-axis) could be detected (with mapped reads great than or equal to 1) in replicate 2 (y-axis). B, Spearman's correlation coefficient. Genes with greater than or equal to 10 mapped reads per sample were retained as highly reproducible genes and used for Spearman correlation analysis. Both library and sequence replicates showed very high correlations at the gene levels (0.930 and 0.926) and species levels (0.984 and 0.989).
Figure 3:Evaluation of inter-platform consistency. For 19 cross-platform replicates at 99% CI, 91.89% genes in the BGISEQ-500 datasets showed the expected mapped read count fluctuations using HiSeq 2000 (A). The Spearman correlation analyses revealed high agreement within 19 pair of platform replicates between BGISEQ-500 and HiSeq 2000 (B) (an average Spearman's rho of 0.724 at gene level [top] and 0.948 at species level [bottom]) and between BGISEQ-500 and HiSeq 4000 (C) (an average Spearman's rho of 0.859 at gene level [top] and 0.965 at species level [bottom]).
Figure 4:A, GC content distributions of genes that differed significantly in abundance between platforms. Density curves showing a comparison of GC content distributions of the total 9.9 million IGC genes (blue), all 349 479 highly reproducible (HR) genes (green), and all 11 350 genes that differed significantly in abundance between the 2 platforms (red line). B, Two-dimensional plot showing the GC content distribution of genes that differed significantly in abundance between the BGISEQ-500 and HiSeq 2000 platforms. The x-axis indicates the GC content of genes, the y-axis indicates fold-changes of gene relative abundance (RA), which is calculated by log10 transformed mean RA in the HiSeq 2000 datasets/mean RA in the BGISEQ-500 datasets. C, D, Density histograms showing the coefficients of a robust linear model for relative abundance of genes from the top 20 species and their GC content for genes that differed significantly in abundance between the 2 platforms (C) and for all HR genes (D). D, E, Density curves (E) and 2-dimensional plot (F) showing the GC content distributions of HR genes that differed significantly in abundance between the BGISEQ-500 and Hiseq 4000 platforms.
Figure 5:Comparison of relative species abundance between BGISEQ-500 and HiSeq 2000. Averaged microbial abundance calculated with Metaphlan2 across BGI replicates plotted against microbial abundance for the corresponding Illumina replicates for all samples. Species are colored by GC content.