| Literature DB >> 31375138 |
Derek M Bickhart1, Mick Watson2, Sergey Koren3, Kevin Panke-Buisse1, Laura M Cersosimo4, Maximilian O Press5, Curtis P Van Tassell6, Jo Ann S Van Kessel7, Bradd J Haley7, Seon Woo Kim7, Cheryl Heiner8, Garret Suen9, Kiranmayee Bakshy1, Ivan Liachko5, Shawn T Sullivan5, Phillip R Myer10, Jay Ghurye11, Mihai Pop11, Paul J Weimer1,9, Adam M Phillippy3, Timothy P L Smith12.
Abstract
We describe a method that adds long-read sequencing to a mix of technologies used to assemble a highly complex cattle rumen microbial community, and provide a comparison to short read-based methods. Long-read alignments and Hi-C linkage between contigs support the identification of 188 novel virus-host associations and the determination of phage life cycle states in the rumen microbial community. The long-read assembly also identifies 94 antimicrobial resistance genes, compared to only seven alleles in the short-read assembly. We demonstrate novel techniques that work synergistically to improve characterization of biological features in a highly complex rumen microbial community.Entities:
Keywords: Hi-C; Metagenome assembly; Metagenomics; PacBio; Virus-host association
Mesh:
Year: 2019 PMID: 31375138 PMCID: PMC6676630 DOI: 10.1186/s13059-019-1760-x
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Assembly workflow and sampling bias estimates show GC% discrepancies in long-read vs short-read assemblies. Using the same sample from a cannulated cow, (a) we extracted DNA using a modified bead beating protocol that still preserved a large proportion of high molecular weight DNA strands. This DNA extraction was sequenced on a short-read sequencer (Illumina; dark green) and a long-read sequencer (PacBio RSII and Sequel; dark orange), with each sequence source assembled separately. Assessments of read- and contig-level GC% bias (b) revealed that a substantial proportion of sampled low GC DNA was not incorporated into either assembly. c Assembly contigs were annotated for likely superkingdoms of origin and were compared for overall contig lengths. The long-read assembly tended to have longer average contigs for each assembled superkingdom compared to the short-read assembly
Assembly statistics
| Assembly | Contigs | Total assembly length (bp) | Contig N100K1 |
|---|---|---|---|
| Illumina | 2,182,263 | 5,111,042,186 | 88 |
| PacBio | 77,670 | 1,076,426,244 | 384 |
1The contig N100K is defined as the total number of contigs that are greater than 100 kbp in length in the entire assembly
Fig. 2Identification of high-quality bins in comparative assemblies highlights the need for dereplication of different binning methods. a Binning performed by Metabat (light blue) and Proximeta Hi-C binning (Hi-C; blue) revealed that the long-read assembly consistently had fewer, longer contigs per bin than a short-read assembly. b Bin set division into medium-quality draft (MQ) and high-quality draft (HQ) bins was based on DAS_Tool single-copy gene (SCG) redundancy and completeness. Assessment of SCG completeness and redundancy revealed 10 and 42 high-quality bins in the long-read (c) and short-read (d) assemblies, respectively. The Proximeta Hi-C binning method performed better in terms of SCG metrics in the long-read assembly. e Plots of all of identified bins in the long-read (triangle) and short-read (circle) assemblies revealed a wide range of chimeric bins containing high SCG redundancy. Bins highlighted in the blue rectangle correspond to the MQ bins identified by the DAS_tool algorithm while the red rectangle corresponds to the HQ bin set
Assembly bin taxonomic assignment and gene content
| Assembled sequence taxonomic affiliation (kbp)1 | |||||||
|---|---|---|---|---|---|---|---|
| Assembly | Bin set | Avg # complete ORFs per contig2 | Archaea | Bacteria | Eukaryota | Viruses | No-hits |
| Illumina | Unbinned | 1.31 | 46,405 | 3,419,539 | 125,885 | 6287 | 1,019,041 |
| MQ | 3.39 | 4543 | 393,630 | 3906 | 71 | 14,113 | |
| HQ | 7.66 | 1056 | 75,467 | 575 | 4 | 523 | |
| PacBio | Unbinned | 14.6 | 10,686 | 854,468 | 7707 | 2290 | 26,804 |
| MQ | 20.8 | 885 | 149,168 | 811 | 50 | 501 | |
| HQ | 48.2 | 1809 | 20,711 | 512 | 0 | 17 | |
1superkingdom taxonomic affiliation was based on contig-level assignments derived from the BlobTools/DIAMOND workflow
2Complete ORFs were defined as Prodigal predictions that had a “partial” status of “00,” which indicates the presence of a start and stop codon for the ORF
Fig. 3Dataset novelty compared to other rumen metagenome assemblies. Chord diagrams showing the contig alignment overlap (by base pair) of the short-read (a) and long-read (b) contigs to the Hungate1000 and Stewart et al. [18] rumen microbial assemblies. The “Both” category consists of alignments of the short-read and long-read contigs that have alignments to both Stewart et al. [18] and the Hungate1000 datasets. c A dendrogram comparison of dataset sampling completeness compared to 16S V4 amplicon sequence data analysis. The outer rings of the dendrogram indicate the presence (blue) or absence (red) of the particular phylotype in each dataset. Datasets are represented in the following order (from the outer edge to the internal edge): (1) the short-read assembly contigs, (2) the long-read assembly contigs, and (3) 16S V4 amplicon sequence data. The internal dendrogram represents each phylum in a different color (see legend), with individual tiers corresponding to the different levels of taxonomic affiliation. The outermost edge of the dendrogram consists of the genus-level affiliation
Fig. 4Network analysis of long-read alignments and Hi-C intercontig links identifies hosts for assembled viral contigs. In order to identify putative hosts for viral contigs, PacBio read alignments (light blue edges) and Hi-C intercontig link alignments (dark blue edges) were counted between viral contigs (hexagons) and non-viral contigs (circles) in the long-read assembly (a) and the short-read assembly (b). Instances where both PacBio reads and Hi-C intercontig links supported a virus-host assignment are also labeled (red edges). The long-read assembly enabled the detection of more virus-host associations in addition to several cases where viral contigs may display cross-species infectivity. We identified several viral contigs that infect important species in the rumen, including those from the genus Sutterella, and several species that metabolize sulfur. In addition, we identified a candidate viral association with a novel genus of rumen microbes identified in this study
Fig. 5CRISPR array identification and ARG allele class counts were influenced by assembly quality. a The long-read assembly (dark orange) contigs had fewer identified CRISPR arrays than the short-read contigs (dark green); however, the CRISPR arrays with the largest count of spacers were overrepresented in the long-read assembly. b The long-read assembly had 13-fold higher antimicrobial resistance gene (ARG) alleles than the short-read assembly despite having 5-fold less sequence data coverage. The macrolide, lincosamide, and tetracycline ARG classes were particularly enriched in the long-read assembly compared to alleles identified in the short-read assembly