| Literature DB >> 27542753 |
Anestis Gkanogiannis1,2,3, Stéphane Gazut4, Marcel Salanoubat1,2,3, Sawsan Kanj4, Thomas Brüls5,6,7.
Abstract
BACKGROUND: Metagenomics holds great promises for deepening our knowledge of key bacterial driven processes, but metagenome assembly remains problematic, typically resulting in representation biases and discarding significant amounts of non-redundant sequence information. In order to alleviate constraints assembly can impose on downstream analyses, and/or to increase the fraction of raw reads assembled via targeted assemblies relying on pre-assembly binning steps, we developed a set of binning modules and evaluated their combination in a new "assembly-free" binning protocol.Entities:
Keywords: Binning; Environmental genomics; Metagenomics; Microbiome; Sequence clustering; Unsupervised learning
Mesh:
Substances:
Year: 2016 PMID: 27542753 PMCID: PMC4992282 DOI: 10.1186/s12859-016-1186-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Coverage biases in metagenome assemblies
| Abundance class (Bin #) | Bin abundance level | % Binned reads in assembly | Estimated number of genomes in bin |
|---|---|---|---|
| Bin I | 320 | 2.8 % | > = 3 |
| Bin II | 180 | 2.9 % | > = 1 |
| Bin III | 90 | 12.6 % | > = 2 |
| Bin IV | 30 | 54.0 % | > = 6 |
| Bin V | 9 | 0.5 % | > = 3 |
Unassembled (raw) reads derived from a xenobiotic degrading bacterial consortium (Chaussonerie et al. 2016 under review) were segregated by the AB-Cl module (k-mer size = 25) into 5 abundance classes (bins). Mapping of reads from individual bins on the metagenome assembly based on all the raw reads reveals a significant under-representation of abundance classes I, II, III and V
Fig. 2Abundance distributions of synthetic bacterial communities. Abundance distribution of sampled species for various datasets of increasing complexity, ranging from 5 to 700 distinct genomes (see main Text and Methods). A four order of magnitude difference between the number of cells from the most abundant versus less abundant organisms is used to determine power law parameters for each dataset (see Methods)
Fig. 3Heat map showing sampling levels for the 700 distinct genomes (rows) in each of the 50 samples (columns). Abundance levels of bacterial genomes across the 50 microbiome samples used to investigate the biomarker discovery use case
Fig. 1Frequency histogram of long k-mers computed from a simulated community genome. The histogram was generated by counting 20-mers from the synthetic dataset assembled from 500 distinct bacterial genomes, using 150 bp read length and an abundance distribution spanning 4 orders of magnitude (see Methods). k-mer frequencies are shown on the x axis; the number of distinct k-mers with a given frequency are shown on the y axis
Coverage-based binning can enhance the recovery of individual genomes from metagenomes
| Abundance class (Bin #) | % Binned reads in global assembly | % Binned reads from “key player” | Bin coverage (estimated by CB-Cl) | Bin coverage measured (from read mapping) | Bin size (estimated by CB-Cl) | Bin size measured (assembly length) | %Reads used in assembly of individual bins |
|---|---|---|---|---|---|---|---|
| Bin I | 0.2 % | 91.3 % | 581 | 656 | 6.8 Mbp | 6.5 Mbp | 91.3 % |
| Bin II | 0.5 % | 1.3 % | 317 | 306 | 7.0 Mbp | 8.9 Mbp | 86.3 % |
| Bin III | 1.6 % | 0.0 % | 140 | 152 | 20.4 Mbp | 21.2 Mbp | 86.2 % |
| Bin IV | 49.8 % | 0.0 % | 47 | 44 | 38.9 Mbp | 37.6 Mbp | 71.4 % |
| Bin V | 37.7 % | 0.0 % | 14 | 17 | 117 Mbp | 39 Mbp | 30.4 % |
Raw reads generated from a poly-aromatic hydrocarbon degrading enrichment culture [18] were assembled (with the ALLPATHS program [20]) globally on one hand, and segregated into 5 abundance classes (bins) followed by targeted assembly of reads from individual classes on the other hand. The assembly of reads with the highest coverage (Bin I) led to the reconstruction of a single 6.5 Mbp genome (key player), which is missing in the global metagenome assembly
Fig. 4Evaluation of individual clustering modules and their chaining. a Comparison of the coverage-based clustering module (AB-Cl) versus AbundanceBin [8]. Groups of six bars for each sample of increasing complexity represent homogeneity, completeness and V1-measure (measured on the left axis) for AbundanceBin (red bars) and AB-Cl (green bars). Dotted lines denote execution time normalized per core (in logarithmic scale, on the right axis). Missing points result from AbundanceBin failing to process the datasets with 100 or more distinct genomes. b Comparison of the composition-based module (CB-Cl) versus MetaCluster [14]. Groups of six bars for each sample of increasing complexity represent homogeneity, completeness and V1-measure (measured on the left axis) for MetaCluster (red bars) and CB-Cl (green bars). Dotted lines denote execution time normalized per core (in logarithmic scale, on the right axis). Missing points result from MetaCluster failing to process the datasets with more than 500 distinct genomes. c Evaluation of the integrated two-level (2L) pipeline. The AB-Cl module was used for first level clustering, followed by either the CB-Cl module (green) or MetaCluster (red) for second level clustering. Dotted lines denote execution time normalized per core (in logarithmic scale, on the right axis). Missing points for the last dataset (700 distinct genomes) are due to the MetaCluster computation failing to complete
Fig. 5Evaluation of final third-level (3L) clusters. a Third-level clusters’ content in target (pathogen) genome sequences. 3L-clusters (labelled with their identifier on the x axis) are sorted according to their size (measured on the left axis, blue line). Red peaks show the fraction of total pathogen reads embedded in each 3L-cluster (measured on the right axis). b Relevance of final third-level clusters to disease status. 3L-clusters coordinates on the x axis are the same as in Fig. 5a; purple peaks represent information gain (IG) for each 3L-cluster with respect to sick versus healthy class assignments (see Methods). Note that only six 3L-clusters have IG values above background: the three highest red peaks (representing 3L-clusters embodying the bulk of the pathogen genome) correspond to the three highest IG purple peaks (first, second and last peaks); the remaining high IG peaks are correlated to the opposite (i.e., healthy) label, and are devoid of sequences from the pathogen strain
Fig. 6Evaluation of clusters of contigs from the STEC outbreak microbiomes (see main text). (Left panel) Clusters’ content in pathogen (E. coli O104:H4) genome sequence. Peaks show the fraction of the E. coli O104:H4 genome embedded in each cluster computed by the CB-Cl module (k-mer size = 6, 300 output clusters (filtering out one dubious cluster containing more than 10 % of the original sequences); clusters are arbitrarily ordered on the x axis). (Right panel) Strength of the association between clusters and disease status. Cluster coordinates on the x axis are the same as in the left panel; peaks represent mutual information for each cluster with respect to the infection status of the microbiome donors (see main text). The cluster with the highest mutual information value encompasses about 70 % of the E. coli O104:H4 genome sequence
Reconstruction of individual genomes from a low-complexity pyrene degrading bacterial consortium using the coverage-based clustering module (AB-Cl) for sequence pre-assembly
| Abundance class (Bin #) | Number of classified reads | Bin coverage estimate | % Binned reads mapped on reconstructed genome | % Total reads mapped on reconstructed genome | Genomes in given bin (% binned sequences mapped to it) |
|---|---|---|---|---|---|
| Bin I | 168591494 | 1389 | 82 % | 76,52 % | Bordetella (82 %) |
| Bin II | 9445028 | 51 | 97 % | 5,12 % | Mycobacterium (97 %) |
| Bin III | 1870808 | 19 | 98 % | 1,04 % | Stenotrophomonas (72 %), Sphingopyxis (22 %), Mycobacterium (3 %) |
| Bin IV | 1331276 | 11 | 98 % | 0,83 % | Sphingopyxis (76 %), Stenotrophomonas (19 %), Bordetella (3 %) |
The completeness of the reconstructed genomes was assessed using lineage specific marker genes with CheckM [19], and yielded completeness estimates ranging from 97 to 99 %, with less than 2 % contamination and negligible strain level heterogeneity. Derived from Adam I.K. et al., under review