| Literature DB >> 26655498 |
Volodymyr Kuleshov1,2, Chao Jiang2, Wenyu Zhou2, Fereshteh Jahanbani2, Serafim Batzoglou1, Michael Snyder2.
Abstract
Identifying bacterial strains in metagenome and microbiome samples using computational analyses of short-read sequences remains a difficult problem. Here, we present an analysis of a human gut microbiome using TruSeq synthetic long reads combined with computational tools for metagenomic long-read assembly, variant calling and haplotyping (Nanoscope and Lens). Our analysis identifies 178 bacterial species, of which 51 were not found using shotgun reads alone. We recover bacterial contigs that comprise multiple operons, including 22 contigs of >1 Mbp. Furthermore, we observe extensive intraspecies variation within microbial strains in the form of haplotypes that span up to hundreds of Kbp. Incorporation of synthetic long-read sequencing technology with standard short-read approaches enables more precise and comprehensive analyses of metagenomic samples.Entities:
Mesh:
Year: 2015 PMID: 26655498 PMCID: PMC4884093 DOI: 10.1038/nbt.3416
Source DB: PubMed Journal: Nat Biotechnol ISSN: 1087-0156 Impact factor: 54.908
Figure 1The Nanoscope pipeline and the Lens algorithm. Left: Nanoscope first assembles short and long reads using the Soapdenovo2 and Celera assemblers and merges the results with Minimus2; it then assigns taxonomic labels to contigs with the Fragment Classification Package (FCP) and identifies bacterial strains with Lens; finally, it estimates abundances of detected bacterial species by mapping short reads to contigs and by aggregating the coverage over all contigs assigned to the same species. Right: The Lens algorithm identifies heterozygous variants in the assembled genomic contigs (a); these variants are supported by long reads (b) aligned to the contigs. Each long read originates from a single organism; thus the variants it supports must belong to the same substrain. By connecting reads at their overlapping variants, Lens places the variants into multi-kilobase-long haplotypes (c) associated with bacterial strains. The number of haplotypes is a priori unknown and is inferred from the data.
Assembly of the human gut metagenome. Short and long read libraries were assembled with the Soapdenovo2 and Celera assemblers, respectively. The results were merged using Minimus2 to produce a joint assembly. Long reads assemble into significantly longer contigs that contain about twice as many genes.
| Short | Long | Joint | |
|---|---|---|---|
| Number of contigs | 92,247 | 24,199 | 34,786 |
| Largest contig (Mbp) | 0.63 | 3.94 | 3.94 |
| Total length (Mbp) | 233 | 610 | 656 |
| N50 (Kbp) | 8.7 | 37.3 | 49.2 |
| Number of predicted genes | 274,600 | 523,358 | 552,680 |
| Average number of genes/contig | 2.98 | 21.62 | 15.88 |
Overview of the variation among bacterial strains identified by the Lens algorithm. The human gut contains hundreds of thousands of variants, which are distributed across 2,204 genomic regions of up to 112 kbp in length. A region is defined as a maximal set of variants that can be phased by Lens using long reads.
| Genomic variants | 202,574 |
| Genomic regions harboring haplotypes | 2,204 |
| Number of haplotypes | 5,024 |
| N50 region length (bp) | 18,985 |
| Max region length (bp) | 112,271 |
| Fraction of regions intersecting a gene | 95% |
| Fraction of genes intersecting a region | 4.4% |
Figure 2Long reads aligned to assembled metagenomic contigs reveal extensive variation among bacterial strains. Top: Fragment of a 110 kbp long region within a metagenomic contig belonging to the species Odoribacter splanchnicus; the region harbors numerous strain variants that can be assembled into bacterial haplotypes. Bottom left: Fragment of a bacterial region containing 32 genomic variants that assemble into four bacterial haplotypes. Bottom right: These haplotypes can be placed in an evolutionary tree satisfying perfect phylogeny; for simplicity, we visualize this tree over 4 of the 32 positions in the region (upper left corner).
Figure 3Bacterial strains identified only by long reads (blue), only by short reads (magenta), by both technologies (green), and only by a combination of the two (black), ordered by abundance. Long reads identify 51 species that short reads do not detect; combining short and long reads identifies 58 additional species, including ones having the lowest abundance. A total of 178 species are detected using all the methods.
Comparison to alternative technologies. We obtain similar results to alternative techniques that used hundreds of pooled samples (Nielsen et al.) or potentially inaccurate binning approaches (Albertsen et al., Iverson et al.). We also analyze strains at the resolution of individual variants and haplotypes rather than strains or species.
| Our method | Nielsen et al. | Albertsen et al. | Iverson et al. | Sharon et al. | |
|---|---|---|---|---|---|
| Sample type | Gut microbiome | Gut microbiome | Environmental | Environmental | Environmental |
| # of samples | 1 | 18–396 | 2 | 1 | 3 independent |
| Seq. platform | Tru-seq SLR | Illumina WGS | Illumina WGS | SOLiD mate-pairs | Tru-seq SLR |
| Seq. amount | 8 Gbp (long reads) ×40 (subassembly) | 4.5 Gbp/sample | 86 Gbp | 59 Gbp | 1.5 Gbp (long reads) ×40 (subassembly) |
| Analysis type | De-novo assembly; | Correlation across multiple samples | DNA extraction efficiency binning | Tetranucleotide binning | De-novo assembly |
| Resolution | Individual SNV | Strain | Species with diff. GC content | Family | Strain |
| Longest scaffold | 3.9 Mbp | 733 Kbp | 3.6 Mbp | 2.2 Mbp | <20 Kbp |
| Scaffold N50 | 49 Kbp | 39 Kbp[ | 4.1 Kbp overall | 6.8 Kbp | 8.2 Kbp |
| Bases assembled | 656 Mbp | 45 Mbp (genes) | 423 Mbp | 300 Mbp | 500 Mbp/sample |
| # Variants | 200K | n/a | n/a | n/a | n/a |
| # Haplotypes | 5K | n/a | n/a | n/a | n/a |