| Literature DB >> 28472498 |
Cedric C Laczny1, Christina Kiefer1, Valentina Galata1, Tobias Fehlmann1, Christina Backes1, Andreas Keller1.
Abstract
Metagenomics-based studies of mixed microbial communities are impacting biotechnology, life sciences and medicine. Computational binning of metagenomic data is a powerful approach for the culture-independent recovery of population-resolved genomic sequences, i.e. from individual or closely related, constituent microorganisms. Existing binning solutions often require a priori characterized reference genomes and/or dedicated compute resources. Extending currently available reference-independent binning tools, we developed the BusyBee Web server for the automated deconvolution of metagenomic data into population-level genomic bins using assembled contigs (Illumina) or long reads (Pacific Biosciences, Oxford Nanopore Technologies). A reversible compression step as well as bootstrapped supervised binning enable quick turnaround times. The binning results are represented in interactive 2D scatterplots. Moreover, bin quality estimates, taxonomic annotations and annotations of antibiotic resistance genes are computed and visualized. Ground truth-based benchmarks of BusyBee Web demonstrate comparably high performance to state-of-the-art binning solutions for assembled contigs and markedly improved performance for long reads (median F1 scores: 70.02-95.21%). Furthermore, the applicability to real-world metagenomic datasets is shown. In conclusion, our reference-independent approach automatically bins assembled contigs or long reads, exhibits high sensitivity and precision, enables intuitive inspection of the results, and only requires FASTA-formatted input. The web-based application is freely accessible at: https://ccb-microbe.cs.uni-saarland.de/busybee.Entities:
Mesh:
Year: 2017 PMID: 28472498 PMCID: PMC5570254 DOI: 10.1093/nar/gkx348
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Overview of individual components of the BusyBee Web results page. (A) Input sequences are represented as individual points (according to the thresholds tb and tc) in the 2D scatterplot. Convex hulls (black polygons) delineate the predicted clusters. If the optional taxonomic and functional annotations were enabled, taxon and antibiotic resistance-related information is shown to the right of the scatterplot. Individual clusters, bins or taxa can be shown or hidden and sequences encoding for specific antibiotic resistance genes can be highlighted using points of larger size and dark color, here, for the vanB gene. A left-click on a point reveals detailed information about the respective sequence, e.g. the taxonomic lineage or encoded antibiotic resistance genes. The user can pan and zoom the plot using the mouse, e.g. to focus on a region of interest, and point sizes are easily adjusted using sliders below the 2D scatterplot. (B) Bin quality estimates (completeness, contamination, strain heterogeneity) are provided as a sortable table, here, sorted by decreasing completeness. An excerpt representing the five most complete bins is shown. (C) The optional taxonomic compositions of the clusters/bins are shown as stacked bar charts. The taxonomic rank, e.g. genus, can be selected and a second chart can be shown to compare the compositions of the individual clusters/bins at different ranks, e.g. genus versus family.
Figure 2.Screenshots of the interactive scatterplots for (A) ground truth-based Illumina (Shakya2013), (B) ground truth-based ONT, (C) small-scale Illumina and (D) PacBio metagenomic data. (A) A compression of 1 (‘1NN’) as well as sequence chunks (3 kbp chunk-length) derived from the full-length contigs were used. (B) Only sequences with species-level taxonomic assignments are shown. (C) Sequences encoding for class A CTX-M beta-lactamases (CTXM-RF0059) are highlighted. (D) A compression of 1 (‘1NN’) was used. The convex hulls (black polygons) delineate the individual sequence clusters. Descriptions at the top of each plot represent job names; if none is specified, a unique job ID is shown. Colors are based on species-level taxonomic assignments.
BusyBee Web runtimes reported in minutes for the herein studied ground truth and real-world datasets
| # sequences | Total length [bp] | Binning runtime [min] | Total runtime [min] | ||
|---|---|---|---|---|---|
| Ground truth | Shakya2013 | 24 974 | 179 063 212 | 8 | 30 |
| Gregor2016 | 14 393 | 142 556 476 | 6 | 23 | |
| ONT | 21 000 | 97 715 136 | 11 | 20 | |
| Real-world | Small-scale Illumina | 859 | 50 964 782 | 1 | 6 |
| Large-scale Illumina‡ | 133 149 | 399 132 179 | 28 | 75 | |
| PacBio† | 71 029 | 93 937 106 | 18 | 27 |
Runtimes were determined manually based on the progress interface in the browser and were rounded to the next full minute. The minimum sequence length threshold was 1 kbp for the large-scale Illumina dataset and 500 bp for the other datasets.
†Compression of 1 was used.
‡Compression of 2 was used.