| Literature DB >> 29227468 |
John Beaulaurier1,2, Shijia Zhu1,2, Gintaras Deikus1,2, Ilaria Mogno1,2,3, Xue-Song Zhang4, Austin Davis-Richardson5, Ronald Canepa5, Eric W Triplett5, Jeremiah J Faith1,2,3, Robert Sebra1,2,6, Eric E Schadt1,2,6, Gang Fang1,2.
Abstract
Shotgun metagenomics methods enable characterization of microbial communities in human microbiome and environmental samples. Assembly of metagenome sequences does not output whole genomes, so computational binning methods have been developed to cluster sequences into genome 'bins'. These methods exploit sequence composition, species abundance, or chromosome organization but cannot fully distinguish closely related species and strains. We present a binning method that incorporates bacterial DNA methylation signatures, which are detected using single-molecule real-time sequencing. Our method takes advantage of these endogenous epigenetic barcodes to resolve individual reads and assembled contigs into species- and strain-level bins. We validate our method using synthetic and real microbiome sequences. In addition to genome binning, we show that our method links plasmids and other mobile genetic elements to their host species in a real microbiome sample. Incorporation of DNA methylation information into shotgun metagenomics analyses will complement existing methods to enable more accurate sequence binning.Entities:
Mesh:
Year: 2017 PMID: 29227468 PMCID: PMC5762413 DOI: 10.1038/nbt.4037
Source DB: PubMed Journal: Nat Biotechnol ISSN: 1087-0156 Impact factor: 54.908
Figure 1Overview of metagenomic binning using DNA methylation detected in SMRT long reads
Given a set of metagenomic shotgun SMRT sequencing reads, one can either assemble them into contigs for contig-level binning or can directly perform read-level binning without de novo assembly. A widely used approach for unsupervised binning of metagenomic contigs uses coverage (and its covariance across multiple samples) and sequence composition profiles, but these can be complemented by methylation profiles to better segregate contigs with similar sequence composition and coverage covariance, as well as to map mobile genetic elements to contigs from their host bacterium in the microbiome sample. Read-level binning by sequence composition can isolate reads from low abundance species that do not assemble into contigs, while read binning by methylation profiles can segregate reads from multiple strains for the purpose of separate, strain-specific de novo genome assemblies. These methylation and composition features can be combined with abundance features to maximize binning resolution.
Figure 2Metagenomic binning by methylation profiles
(a) Receiver operating characteristic (ROC) curve illustrating the power to classify a contig as methylated or non-methylated regarding a specific sequence motif, as a function of the number IPD values available for the motif sites on the contig. (b) Heatmap of contig-level methylation scores for fourteen motifs on a set of contigs from a metagenomic assembly of eight bacterial species. Contigs from each species possess distinct methylation profiles across the selected motifs. (c) t-SNE scatter plot of contig-level methylation scores across fourteen selected motifs, with manually selected bins marked by boxes. Cluster silhouette coefficients[51] were computed for the contigs from the four Bacteroides species. The coefficients (-1 indicates complete mixing, while 1 indicates complete separation) were 0.53 using methylation features and t-SNE, 0.14 using 5-mer frequency features and t-SNE (Supplementary Fig. 1a), and -0.03 using plotted coverage vs. GC-content values (Supplementary Fig. 1b). (d) Family-level annotation of 16S rRNA gene amplicon sequencing reads from an adult mouse gut microbiome by QIIME[52]. (e) t-SNE projection of metagenomic contigs assembled from SMRT reads of an adult mouse gut microbiome, organized according to differing methylation profiles across 38 sequence motifs in the sample. Labeled bins denote genome-scale assemblies with distinct methylation profiles (Table 1) (f) Coverage values for contigs (>100kp to exclude small MGEs) in each of the nine bins identified by methylation binning.
Genomes binned from adult mouse gut microbiome using DNA methylation profiles
Annotation of binned contigs was conducted using Kraken. The taxonomic order with the largest percentage of binned bases assigned to that order is reported for each bin. Assembly validation was done using CheckM and reflected the presence or absence of a set of single-copy marker genes. Significant motifs are those with a mean methylation score across binned contigs greater than 1.6 (28/38 motifs detected from contigs in this assembly are significant in these bins). Mapped mobile genetic elements (MGE) are those with matching methylation profiles to the specified methylation bin.
| Binning statistics | Annotation | Bin validation | Methylation summary | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||
| Bin | Num. | Total | Largest | Contig | Taxonomic order | Completeness | Contamination | Significant motifs | Mean contig | Mapped MGEs |
| 1 | 14 | 4027504 | 1128400 | 1089244 | 98.68 | 2.26 | ACCG | 1.85 | 12.7kb plasmid, 19.1kb conjugative transposon | |
| CC | 2.01 | |||||||||
|
| ||||||||||
| 2 | 9 | 3496584 | 2164130 | 2164130 | 77.48 | 2.01 | CTGC | 2.43 | None found | |
|
| ||||||||||
| 3 | 7 | 3853295 | 2087314 | 2087314 | 99.43 | 1.13 | TC | 1.62 | None found | |
| CC | 2.22 | |||||||||
| CC | 2.50 | |||||||||
|
| ||||||||||
| 4 | 5 | 2759439 | 2712836 | 2712836 | 97.96 | 0.68 | G | 3.11 | None found | |
| G | 2.93 | |||||||||
|
| ||||||||||
| 5 | 10 | 3378404 | 1873721 | 1873721 | 97.55 | 1.76 | AGC | 1.98 | None found | |
| G | 2.27 | |||||||||
|
| ||||||||||
| 6 | 16 | 4441324 | 1159367 | 764722 | 98.36 | 1.26 | ATGC | 1.76 | None found | |
| CC | 1.93 | |||||||||
| AAC | 2.80 | |||||||||
|
| ||||||||||
| 7 | 22 | 6207805 | 2165375 | 1643203 | 98.24 | 21.52 | GGC | 2.22 | 24.7kb plasmid, 14.7kb plasmid, 23.2kb conjugative transposon | |
| GTG | 2.00 | |||||||||
|
| ||||||||||
| 8 | 14 | 3913657 | 2565370 | 2565370 | 97.22 | 2.77 | AG | 2.21 | 14.3kb plasmid, 15.8kb plasmid, 21.1kb conjugative transposon | |
| AG | 1.94 | |||||||||
| G | 1.94 | |||||||||
| AG | 1.72 | |||||||||
| KAG | 2.08 | |||||||||
| TAG | 1.96 | |||||||||
| TG | 1.71 | |||||||||
| G | 1.81 | |||||||||
|
| ||||||||||
| 9 | 1 | 2021078 | 2021078 | 2021078 | 99.19 | 0.00 | CGA | 2.46 | None found | |
| GA | 2.18 | |||||||||
| TGM | 2.48 | |||||||||
| CG | 1.69 | |||||||||
| ACC | 2.20 | |||||||||
Figure 3Methylation profiles can link plasmids to the chromosomal DNA of their host species. (a)
Histogram of sequence-based Euclidian distance between 5-mer frequency vectors of plasmid and chromosome sequences, showing the distance between plasmids and their host chromosome (blue; based on 2,278 bacterial plasmids and their known hosts), as well as the distance between plasmid and randomly sampled chromosomes from other species (red). (b) Heatmap showing methylation profiles for the pHel3 plasmid and its three hosts: E. coli CFT073, E. coli DH5α, and H. pylori JP26. The methylation profile of pHel3 across twenty motifs matches the host from which it was isolated. (c) Simulation analysis (1000 iterations) using 878 SMRT sequenced bacterial genomes in the REBASE database showing expected number of genomes with a unique 6mA methylome as a function of community size and presence of multi-strain species in the community. (d) Simulation analysis (1000 iterations) using 155 SMRT sequenced genomes with known plasmids in the REBASE database showing expected number of genomes with a unique 6mA methylome as a function of community size and presence of multi-strain species in the community. (e) Simulation analysis (500 iterations) using 878 SMRT sequenced genomes in the REBASE database showing the expected sequence lengths required to capture at least one instance of the methylation motifs in a genome. As expected, capturing at least one instance of some, but not all, of the methylation motifs reduces the required sequence length.
Figure 4Binning SMRT reads using composition and DNA methylation profiles
(a) 5-mer frequency-based binning of assembled contigs and raw reads (length>15 kb) from the HMP mock community, where only the unassembled reads are labeled. Reads from the low-abundance species R. sphaeroides form a distinct cluster near the coordinates (-8,-22). (b) The 2D histogram of contigs and unassembled reads, corresponding to (a); this 2D histogram lacks labeling but nevertheless includes many highly species-specific subpopulations. (c) Combined assembly of a synthetic mixture of reads from H. pylori strains J99 and 26995 results in one small contig containing mostly reads from strain 26695 and one large, highly chimeric contig. (d) Read-level methylation profiles for unassembled reads from the synthetic mixture are separated by principal component analysis (PCA) into discrete, strain-specific clusters. (e) Separate assembly of reads that were segregated using methylation profiles results in large, highly strain-specific contigs. (f) Combined assembly of a synthetic mixture of reads from E. coli strains BAA-2196 O26:H11, BAA-2215 O103:H11, and BAA-2440 O111 results in many chimeric contigs that contain reads from all three strains. (g) Reads from the synthetic mixture were aligned to the E. coli K12 MG1655 reference in order to correct sequencing errors and the read-level methylation profiles were separated by PCA into strain-specific clusters. (h) Separate assembly of reads segregated by methylation profiles as demonstrated in (g) results in a dramatic reduction of chimerism in the assembled reads.