| Literature DB >> 31672156 |
Jon G Sanders1, Sergey Nurk2, Rodolfo A Salido1, Jeremiah Minich1, Zhenjiang Z Xu1, Qiyun Zhu1, Cameron Martino1,3, Marcus Fedarko4, Timothy D Arthur1, Feng Chen5, Brigid S Boland6,7, Greg C Humphrey1, Caitriona Brennan1, Karenina Sanders1, James Gaffney1, Kristen Jepsen8, Mahdieh Khosroheidari8, Cliff Green8, Marlon Liyanage1, Jason W Dang1, Vanessa V Phelan9,10, Robert A Quinn9,11, Anton Bankevich2, John T Chang6,7, Tariq M Rana1, Douglas J Conrad12, William J Sandborn6,7,13, Larry Smarr4,14, Pieter C Dorrestein1,8,15, Pavel A Pevzner4,13, Rob Knight16,17,18,19,20.
Abstract
As metagenomic studies move to increasing numbers of samples, communities like the human gut may benefit more from the assembly of abundant microbes in many samples, rather than the exhaustive assembly of fewer samples. We term this approach leaderboard metagenome sequencing. To explore protocol optimization for leaderboard metagenomics in real samples, we introduce a benchmark of library prep and sequencing using internal references generated by synthetic long-read technology, allowing us to evaluate high-throughput library preparation methods against gold-standard reference genomes derived from the samples themselves. We introduce a low-cost protocol for high-throughput library preparation and sequencing.Entities:
Keywords: Assembly; Benchmark; Binning; Leaderboard metagenome; Long reads
Mesh:
Year: 2019 PMID: 31672156 PMCID: PMC6822431 DOI: 10.1186/s13059-019-1834-9
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Illustration of the benchmarking workflow using sample 1 as “primary.” Data products are represented by white ellipses and processing methods by gray rounded rectangles. The workflow consists of two parts. In the first part (TSLR reference creation), TSLR data are generated and assembled for primary sample 1. Coverage information from additional samples is used to bin the TSLR contigs into reference genome bins. In the second part (Assembly evaluation), primary sample 1 is sequenced using various short-read sequencing methods. Assemblies from these alternative methods are then compared against the internal reference to benchmark performance
Fig. 2a–h Genome fraction of internal reference bins recovered in test assemblies. Each panel depicts the performance of the top five reference bins from a separate sample. Reference bins are ordered from the highest to the lowest average recovered genome fraction across the library prep methods tested for that sample (x-axis categories are not comparable between panels)
Fig. 3Assembly performance as a function of estimated genome abundance. Points represent the total fraction of a TSLR reference contig assembled as a function of average read depth for that contig, per library prep methodology. Samples e–h correspond to samples e–h in Fig. 2
Fig. 4Assembly metrics for miniaturized libraries prepared from three different sample sets. a N50 values for samples (points) assembled from miniaturized HyperPlus libraries (horizontal axis) and from miniaturized NexteraXT libraries (vertical axis). Point of equality is indicated by a dotted line, and values are presented for assemblies at a depth of 96 samples per lane (left panel) and at 384 samples per lane (right panel). b The total length of assemblies in contigs exceeding 5 kbp in length
Fig. 5Completeness and contamination statistics for bins recovered from assembly and binning of shallow-sequenced mouse metagenomes. Longitudinal samples for each mother (Mothers) or for each litter (Offspring) were coassembled. “Compositional only” bins were calculated using pooled reads from each longitudinal sample per individual, simulating low-N, high-depth sequencing. “Compositional and alignment” bins were calculated using differential coverage data obtained by mapping each longitudinal sample independently to its individual coassembly